pith. sign in

arxiv: 2201.11990 · v3 · pith:OVM55B4Onew · submitted 2022-01-28 · 💻 cs.CL

Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model

Pith reviewed 2026-05-24 12:09 UTC · model grok-4.3

classification 💻 cs.CL
keywords large language modelstransformerDeepSpeedMegatron3D parallelismzero-shot learningfew-shot learningnatural language generation
0
0 comments X

The pith

A 530 billion parameter transformer model trained via DeepSpeed and Megatron sets new state-of-the-art results on zero-, one-, and few-shot NLP benchmarks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper describes the end-to-end training of Megatron-Turing NLG 530B, the largest monolithic transformer language model reported at the time. It explains the 3D parallelism approach that combines data, model, and pipeline parallelism to fit the model on available hardware, along with the construction and curation of the training corpus. The resulting model delivers higher accuracies than prior systems in zero-shot, one-shot, and few-shot settings across multiple standard NLP tasks. A reader would care because the work shows how hardware-software co-design and data choices together make extreme scale practical and effective for general-purpose language generation.

Core claim

We present details on the training of the largest monolithic transformer based language model, Megatron-Turing NLG 530B (MT-NLG), with 530 billion parameters. Using DeepSpeed and Megatron, we employ a 3D parallelism methodology to enable training at this scale. The design of the training corpus and data curation techniques, which we believe is a key ingredient to the success of the model, allow MT-NLG to achieve superior zero-, one-, and few-shot learning accuracies on several NLP benchmarks and establish new state-of-the-art results.

What carries the argument

3D parallelism (data, model, and pipeline) implemented in DeepSpeed and Megatron that distributes the 530 billion parameter transformer across hardware while maintaining training stability.

If this is right

  • The infrastructure details enable training of monolithic models at hundreds of billions of parameters.
  • Data curation techniques directly contribute to the observed generalization in zero- and few-shot regimes.
  • MT-NLG exhibits new properties in natural language generation that prior smaller models did not display.
  • The same training stack can be reused to push model size further while retaining benchmark gains.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar parallelism and curation patterns could be applied to multimodal models that combine text with images or code.
  • The reported scaling behavior suggests that further increases in parameter count may continue to improve few-shot performance without architectural changes.
  • Open release of the exact corpus composition would allow independent verification of the data-curation hypothesis.

Load-bearing premise

The design of the training corpus and the data curation techniques are a key ingredient to the success of the model.

What would settle it

A controlled replication that trains an identical 530B model on the same hardware and code but with a standard public corpus lacking the described curation steps, then measures whether zero- and few-shot benchmark scores fall below the reported levels.

Figures

Figures reproduced from arXiv: 2201.11990 by Brandon Norick, Bryan Catanzaro, Elton Zhang, George Zerveas, Jared Casper, Julie Bernauer, Michael Houston, Mohammad Shoeybi, Mostofa Patwary, Patrick LeGresley, Rewon Child, Reza Yazdani Aminabadi, Samyam Rajbhandari, Saurabh Tiwary, Shaden Smith, Shrimai Prabhumoye, Vijay Korthikanti, Xia Song, Yuxiong He, Zhun Liu.

Figure 1
Figure 1. Figure 1: Trend of sizes of state-of-the-art NLP models with time. [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Validation loss of MT-NLG. 4 Results and Achievements To provide a better understanding of how language model performance improves during training, we first present the validation loss curve (cross entropy) of MT-NLG in [PITH_FULL_IMAGE:figures/full_fig_p011_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The 100 most common words associated with male and female templates, ordered from most [PITH_FULL_IMAGE:figures/full_fig_p019_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Positive and Negative sentiment scores for each ethnicity [PITH_FULL_IMAGE:figures/full_fig_p022_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Natural Language Inference accuracy on the HANS dataset, as a function of the number of shots [PITH_FULL_IMAGE:figures/full_fig_p025_5.png] view at source ↗
read the original abstract

Pretrained general-purpose language models can achieve state-of-the-art accuracies in various natural language processing domains by adapting to downstream tasks via zero-shot, few-shot and fine-tuning techniques. Because of their success, the size of these models has increased rapidly, requiring high-performance hardware, software, and algorithmic techniques to enable training such large models. As the result of a joint effort between Microsoft and NVIDIA, we present details on the training of the largest monolithic transformer based language model, Megatron-Turing NLG 530B (MT-NLG), with 530 billion parameters. In this paper, we first focus on the infrastructure as well as the 3D parallelism methodology used to train this model using DeepSpeed and Megatron. Next, we detail the training process, the design of our training corpus, and our data curation techniques, which we believe is a key ingredient to the success of the model. Finally, we discuss various evaluation results, as well as other interesting observations and new properties exhibited by MT-NLG. We demonstrate that MT-NLG achieves superior zero-, one-, and few-shot learning accuracies on several NLP benchmarks and establishes new state-of-the-art results. We believe that our contributions will help further the development of large-scale training infrastructures, large-scale language models, and natural language generations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper describes the joint Microsoft-NVIDIA effort to train Megatron-Turing NLG 530B (MT-NLG), a 530-billion-parameter monolithic transformer language model. It details the infrastructure and 3D parallelism techniques implemented with DeepSpeed and Megatron, the training process, the design and curation of the training corpus, and reports that MT-NLG achieves superior zero-, one-, and few-shot accuracies on multiple NLP benchmarks, establishing new state-of-the-art results.

Significance. If the performance claims hold under comparable evaluation conditions, the work provides a valuable engineering record of scaling transformer training to 530B parameters. The explicit treatment of 3D parallelism, data curation practices, and infrastructure choices offers concrete guidance for future large-scale training efforts. The paper also surfaces observations about emergent properties of the model.

major comments (2)
  1. [Evaluation] Evaluation section: the manuscript asserts new state-of-the-art zero-/one-/few-shot results on several NLP benchmarks, yet supplies no benchmark numbers, baselines, or statistical details in the abstract and does not reference a fixed evaluation harness. This absence prevents verification that reported margins are protocol-independent rather than arising from prompt or decoding differences.
  2. [Evaluation] Evaluation section: the paper does not provide the exact prompt templates, number of shots, or decoding settings used for each benchmark where new SOTA is claimed. Without these, it is impossible to confirm that the superiority is attributable to model scale or data rather than evaluation-protocol variations relative to prior work.
minor comments (1)
  1. [Abstract] The abstract would be strengthened by including at least one or two concrete benchmark numbers and the corresponding prior SOTA values.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed review and constructive comments on the evaluation section. We agree that greater transparency in benchmark reporting, baselines, and protocol details will improve verifiability. We will revise the manuscript to address these points while preserving the paper's focus on training infrastructure and data curation.

read point-by-point responses
  1. Referee: [Evaluation] Evaluation section: the manuscript asserts new state-of-the-art zero-/one-/few-shot results on several NLP benchmarks, yet supplies no benchmark numbers, baselines, or statistical details in the abstract and does not reference a fixed evaluation harness. This absence prevents verification that reported margins are protocol-independent rather than arising from prompt or decoding differences.

    Authors: We acknowledge that the abstract contains only a high-level claim without numerical results or harness details, which is typical for length-constrained abstracts but can reduce immediate verifiability. The main Evaluation section does include comparative tables against prior models; however, to strengthen the paper we will (1) add a concise summary of key benchmark scores and baselines to the abstract where space permits, (2) explicitly name the evaluation harness and any custom adaptations in the Evaluation section, and (3) include error bars or statistical notes where multiple runs were performed. These changes will be made in the revised manuscript. revision: yes

  2. Referee: [Evaluation] Evaluation section: the paper does not provide the exact prompt templates, number of shots, or decoding settings used for each benchmark where new SOTA is claimed. Without these, it is impossible to confirm that the superiority is attributable to model scale or data rather than evaluation-protocol variations relative to prior work.

    Authors: We agree that reproducibility of the reported SOTA claims requires the precise prompts, shot counts, and decoding parameters. The current manuscript references standard few-shot setups for the cited benchmarks but does not reproduce the templates. In the revision we will add a dedicated appendix (or subsection) that lists, for every benchmark where a new SOTA is claimed: the exact prompt template, number of shots, decoding strategy (e.g., greedy, nucleus sampling parameters), and any post-processing steps. This will allow direct comparison with prior work and confirm that gains are not protocol artifacts. revision: yes

Circularity Check

0 steps flagged

Empirical training report with no derivation chain

full rationale

The manuscript is an engineering report detailing hardware/software infrastructure (DeepSpeed + Megatron 3D parallelism), training corpus construction, data curation, and measured benchmark accuracies for the 530B model. No equations, fitted parameters, or predictions are presented that reduce by construction to the paper's own inputs. Evaluation results are reported as direct empirical outcomes rather than derived quantities. No self-citation load-bearing steps, ansatz smuggling, or uniqueness theorems appear in the derivation chain. The paper is self-contained as a factual account of a training run.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The work rests on standard assumptions of the transformer architecture and the effectiveness of existing parallelism libraries; no new entities or free parameters are introduced in the abstract.

axioms (2)
  • domain assumption Transformer-based language models scale effectively with parameter count and data quality
    Implicit in the decision to train at 530B parameters and to emphasize data curation.
  • domain assumption 3D parallelism from DeepSpeed and Megatron can be applied without fundamental bottlenecks at this scale
    Stated as the methodology used to enable training.

pith-pipeline@v0.9.0 · 5858 in / 1217 out tokens · 24213 ms · 2026-05-24T12:09:00.539751+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 42 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling

    cs.CL 2023-04 accept novelty 8.0

    Pythia releases 16 identically trained LLMs with full checkpoints and data tools to study training dynamics, scaling, memorization, and bias in language models.

  2. Analyzing Reverse Address Translation Overheads in Multi-GPU Scale-Up Pods

    cs.DC 2026-04 unverdicted novelty 7.0

    Simulation study shows cold TLB misses in reverse address translation dominate latency for small collectives in multi-GPU pods, causing up to 1.4x degradation, while larger ones see diminishing returns.

  3. Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs

    cs.CL 2024-12 unverdicted novelty 7.0

    o1-like models overthink easy tasks; self-training reduces compute use without accuracy loss on GSM8K, MATH500, GPQA, and AIME.

  4. DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

    cs.CL 2024-05 unverdicted novelty 7.0

    DeepSeek-V2 delivers top-tier open-source LLM performance using only 21B active parameters by compressing the KV cache 93.3% and cutting training costs 42.5% via MLA and DeepSeekMoE.

  5. Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks

    cs.CL 2022-11 unverdicted novelty 7.0

    PoT prompting improves numerical reasoning by having language models write programs executed by a computer instead of performing calculations in natural language chains of thought, with an average 12% gain over CoT.

  6. Large Language Models are Zero-Shot Reasoners

    cs.CL 2022-05 accept novelty 7.0

    Adding the fixed prompt 'Let's think step by step' enables large language models to achieve substantial zero-shot gains on arithmetic, symbolic, and logical reasoning benchmarks without any task-specific examples.

  7. OPT: Open Pre-trained Transformer Language Models

    cs.CL 2022-05 unverdicted novelty 7.0

    OPT releases open decoder-only transformers up to 175B parameters that match GPT-3 performance at one-seventh the carbon cost, along with code and training logs.

  8. M$^2$RNN: Non-Linear RNNs with Matrix-Valued States for Scalable Language Modeling

    cs.LG 2026-03 unverdicted novelty 6.0

    M²RNN achieves perfect state tracking at unseen lengths and outperforms Gated DeltaNet hybrids by 0.4-0.5 perplexity on 7B models with 3x smaller recurrent states.

  9. veScale-FSDP: Flexible and High-Performance FSDP at Scale

    cs.DC 2026-02 unverdicted novelty 6.0

    veScale-FSDP uses RaggedShard and structure-aware planning to support block-wise quantization and non-element-wise optimizers while delivering 5-66% higher throughput and 16-30% lower memory than prior FSDP systems at...

  10. Chameleon: Adaptive Fault Tolerance for Distributed Training via Real-time Policy Selection

    cs.DC 2025-08 unverdicted novelty 6.0

    Chameleon provides adaptive fault tolerance for distributed training by real-time selection of optimal recovery policies via a unified performance model, demonstrated with low overhead on a 32-card cluster.

  11. Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning

    cs.AI 2025-07 conditional novelty 6.0

    Math reasoning gains in LLMs rarely transfer to general domains; RL tuning generalizes while SFT causes forgetting and representation drift.

  12. MiniMax-01: Scaling Foundation Models with Lightning Attention

    cs.CL 2025-01 unverdicted novelty 6.0

    MiniMax-01 models match GPT-4o and Claude-3.5-Sonnet performance while providing 20-32 times longer context windows through lightning attention and MoE scaling.

  13. The Falcon Series of Open Language Models

    cs.CL 2023-11 conditional novelty 6.0

    Falcon-180B is a 180B-parameter open decoder-only model trained on 3.5 trillion tokens that approaches PaLM-2-Large performance at lower cost and is released with dataset extracts.

  14. DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models

    cs.LG 2023-09 accept novelty 6.0

    DeepSpeed-Ulysses keeps communication volume constant for sequence-parallel attention when sequence length and device count scale together, delivering 2.5x faster training on 4x longer sequences than prior SOTA.

  15. Scaling Data-Constrained Language Models

    cs.CL 2023-05 conditional novelty 6.0

    Repeating training data up to 4 epochs yields negligible loss increase versus unique data for fixed compute, and a new scaling law accounts for the decaying value of repeated tokens and excess parameters.

  16. Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes

    cs.CL 2023-05 conditional novelty 6.0

    Distilling step-by-step uses LLM-generated rationales as additional supervision in a multi-task framework so that 770M-parameter models outperform 540B-parameter models on NLP benchmarks with only 80% of the data.

  17. MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

    cs.CV 2023-04 conditional novelty 6.0

    MiniGPT-4 shows that aligning a frozen vision encoder to Vicuna via one projection layer plus a second-stage detailed-description fine-tune produces GPT-4-like vision-language abilities including detailed captions, cr...

  18. BloombergGPT: A Large Language Model for Finance

    cs.LG 2023-03 conditional novelty 6.0

    BloombergGPT is a 50B parameter LLM trained on a 708B token mixed financial and general dataset that outperforms prior models on financial benchmarks while preserving general LLM performance.

  19. Language Models can Solve Computer Tasks

    cs.CL 2023-03 accept novelty 6.0

    Pre-trained LLMs using recursive criticism and improvement prompting achieve state-of-the-art results on the MiniWoB++ computer task benchmark with only a handful of demonstrations and no task-specific reward function.

  20. FP8 Formats for Deep Learning

    cs.LG 2022-09 unverdicted novelty 6.0

    FP8 formats E4M3 and E5M2 match 16-bit training accuracy on CNNs, RNNs, and Transformers up to 175B parameters without hyperparameter changes.

  21. Atlas: Few-shot Learning with Retrieval Augmented Language Models

    cs.CL 2022-08 unverdicted novelty 6.0

    Atlas reaches over 42% accuracy on Natural Questions with only 64 examples, outperforming a 540B-parameter model by 3% with 50x fewer parameters.

  22. Efficient Training of Language Models to Fill in the Middle

    cs.CL 2022-07 unverdicted novelty 6.0

    Autoregressive language models trained on data with middle spans relocated to the end learn infilling without degrading left-to-right perplexity or sampling quality.

  23. MRKL Systems: A modular, neuro-symbolic architecture that combines large language models, external knowledge sources and discrete reasoning

    cs.CL 2022-05 unverdicted novelty 6.0

    MRKL is a modular neuro-symbolic architecture that integrates LLMs with external knowledge and discrete reasoning to overcome limitations of pure neural language models.

  24. GPT-NeoX-20B: An Open-Source Autoregressive Language Model

    cs.CL 2022-04 accept novelty 6.0

    GPT-NeoX-20B is a publicly released 20B parameter autoregressive language model trained on the Pile that shows strong gains in five-shot reasoning over similarly sized prior models.

  25. PaLM: Scaling Language Modeling with Pathways

    cs.CL 2022-04 accept novelty 6.0

    PaLM 540B demonstrates continued scaling benefits by setting new few-shot SOTA results on hundreds of benchmarks and outperforming humans on BIG-bench.

  26. Charon: A Unified and Fine-Grained Simulator for Large-Scale LLM Training and Inference

    cs.DC 2026-05 unverdicted novelty 5.0

    Charon is a unified modular simulator that predicts LLM training and inference performance with under 5.35% error and identifies throughput improvements over baselines in a real deployment case.

  27. Charon: A Unified and Fine-Grained Simulator for Large-Scale LLM Training and Inference

    cs.DC 2026-05 unverdicted novelty 5.0

    Charon is a unified fine-grained simulator that predicts LLM performance with under 5.35% error overall and under 3.74% for large-scale training, and it found a better inference configuration than an engineering baseline.

  28. Transforming the Use of Earth Observation Data: Exascale Training of a Generative Compression Model with Historical Priors for up to 10,000x Data Reduction

    cs.DC 2026-05 unverdicted novelty 5.0

    A generative compression model using historical priors for Earth observation data achieves up to 10,000x reduction after exascale training on an Armv9 supercomputer.

  29. TACO: Efficient Communication Compression of Intermediate Tensors for Scalable Tensor-Parallel LLM Training

    cs.DC 2026-04 unverdicted novelty 5.0

    TACO compresses tensor-parallel intermediate tensors with an adaptive FP8 scheme and fused kernels, yielding up to 1.87X throughput gains on GPT and Qwen models with near-lossless accuracy.

  30. SparseBalance: Load-Balanced Long Context Training with Dynamic Sparse Attention

    cs.LG 2026-04 unverdicted novelty 5.0

    SparseBalance dynamically adjusts sparsity and batches workloads to load-balance sparse attention training, delivering up to 1.33x speedup and 0.46% better long-context performance on LongBench.

  31. SEDD: Scalable and Efficient Dataset Deduplication with GPUs

    cs.CL 2025-01 unverdicted novelty 5.0

    SEDD delivers a distributed GPU deduplication system that reports up to 158x speedup over CPU baselines and 7.8x over NeMo Curator on 30M documents while preserving MinHash fidelity above 0.95 Jaccard.

  32. MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning

    cs.CV 2023-10 unverdicted novelty 5.0

    MiniGPT-v2 adds unique task identifiers to a large language model so one system can perform image description, visual question answering, and visual grounding after three-stage training.

  33. StarCoder: may the source be with you!

    cs.CL 2023-05 accept novelty 5.0

    StarCoderBase matches or beats OpenAI's code-cushman-001 on multi-language code benchmarks; the Python-fine-tuned StarCoder reaches 40% pass@1 on HumanEval while retaining other-language performance.

  34. RAFT: Reward rAnked FineTuning for Generative Foundation Model Alignment

    cs.LG 2023-04 unverdicted novelty 5.0

    RAFT aligns generative models by ranking samples with a reward model and fine-tuning only on the top-ranked outputs, reporting gains on reward scores and automated metrics for LLMs and diffusion models.

  35. Cross-Layer Energy Analysis of Multimodal Training on Grace Hopper Superchips

    cs.DC 2026-05 unverdicted novelty 4.0

    On Grace Hopper superchips, energy efficiency during multimodal training is governed by data movement and overlap rather than compute utilization, and runtime-optimal configurations are not always energy-optimal.

  36. DeepSeek LLM: Scaling Open-Source Language Models with Longtermism

    cs.CL 2024-01 unverdicted novelty 4.0

    DeepSeek LLM 67B exceeds LLaMA-2 70B on code, mathematics and reasoning benchmarks after pre-training on 2 trillion tokens and alignment via SFT and DPO.

  37. Phoenix-VL 1.5 Medium Technical Report

    cs.CL 2026-05 unverdicted novelty 3.0

    Phoenix-VL 1.5 Medium is a 123B-parameter natively multimodal model that reaches state-of-the-art results on Singapore multimodal, legal, and policy benchmarks after localized training on 1T+ tokens while staying comp...

  38. A Scalable Recipe on SuperMUC-NG Phase 2: Efficient Large-Scale Training of Language Models

    cs.DC 2026-05 unverdicted novelty 3.0

    A combined parallelism recipe on SuperMUC-NG Phase 2 delivers 10% of theoretical peak throughput for 175B models plus 93% weak and 82% strong scaling efficiency on 128 nodes using unmodified public software.

  39. Towards EnergyGPT: A Large Language Model Specialized for the Energy Sector

    cs.CL 2025-09 unverdicted novelty 3.0

    Fine-tuned LLaMA 3.1-8B variants for the energy sector outperform the base model on domain QA benchmarks, with LoRA delivering similar gains at lower training cost.

  40. Large Language Models: A Survey

    cs.CL 2024-02 accept novelty 3.0

    The paper surveys key large language models, their training methods, datasets, evaluation benchmarks, and future research directions in the field.

  41. A Survey of Large Language Models

    cs.CL 2023-03 accept novelty 3.0

    This survey reviews the background, key techniques, and evaluation methods for large language models, emphasizing emergent abilities that appear at large scales.

  42. A Comprehensive Overview of Large Language Models

    cs.CL 2023-07 unverdicted novelty 2.0

    A survey paper providing an overview of Large Language Models, their background, and recent advances in the field.

Reference graph

Works this paper leans on

78 extracted references · 78 canonical work pages · cited by 41 Pith papers · 17 internal anchors

  1. [1]

    https://www.nvidia.com/en-us/data-center/a100/

    NVIDIA A100 Tensor Core GPU. https://www.nvidia.com/en-us/data-center/a100/

  2. [2]

    https://www.top500.org/system/179842/

    NVIDIA Selene Supercomputer. https://www.top500.org/system/179842/

  3. [3]

    https://www.nvidia.com/en-us/data-center/nvlink/

    NVLink and NVSwitch. https://www.nvidia.com/en-us/data-center/nvlink/

  4. [4]

    https://www.microsoft.com/en-us/research/blog/ turing-nlg-a-17-billion-parameter-language-model-by-microsoft/

    Turing-NLG: A 17-billion-parameter language model by Mi- crosoft. https://www.microsoft.com/en-us/research/blog/ turing-nlg-a-17-billion-parameter-language-model-by-microsoft/

  5. [5]

    https://wudaoai.cn/home

    Wu Dao 2.0 Large-scale Pretrained Model. https://wudaoai.cn/home

  6. [6]

    Piqa: Reasoning about physical commonsense in natural language

    Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng Gao, and Yejin Choi. Piqa: Reasoning about physical commonsense in natural language. In AAAI, 2020

  7. [7]

    Zou, Venkatesh Saligrama, and Adam Tauman Kalai

    Tolga Bolukbasi, Kai-Wei Chang, James Y . Zou, Venkatesh Saligrama, and Adam Tauman Kalai. Man is to computer programmer as woman is to homemaker? debiasing word embeddings. In NIPS, 2016

  8. [8]

    On the Opportunities and Risks of Foundation Models

    Rishi Bommasani, Drew A Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, et al. On the opportuni- ties and risks of foundation models. arXiv preprint arXiv:2108.07258, 2021

  9. [9]

    Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert- V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litw...

  10. [10]

    BoolQ: Exploring the surprising difficulty of natural yes/no questions

    Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. BoolQ: Exploring the surprising difficulty of natural yes/no questions. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), ...

  11. [11]

    langdetect, 2021

    Michal Danilk. langdetect, 2021. Version 1.0.9

  12. [12]

    Bert: Pre-training of deep bidirectional transformers for language understanding

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In NAACL, 2019

  13. [13]

    Language and Gender

    Penelope Eckert and Sally McConnell-Ginet. Language and Gender . Cambridge University Press, 2003

  14. [14]

    Improving gender fairness of pre-trained language models without catastrophic forgetting

    Zahra Fatemi, Chen Xing, Wenhao Liu, and Caiming Xiong. Improving gender fairness of pre-trained language models without catastrophic forgetting. arXiv preprint arXiv:2110.05367, 2021

  15. [15]

    William Fedus, Barret Zoph, and Noam M. Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.ArXiv, abs/2101.03961, 2021

  16. [16]

    Lyn Frazier and Janet D. Fodor. The sausage machine: A new two-stage parsing model. Cognition, 6(4):291–325, 1978. Place: Netherlands Publisher: Elsevier Science

  17. [17]

    The Pile: An 800GB Dataset of Diverse Text for Language Modeling

    Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, Shawn Presser, and Connor Leahy. The Pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020

  18. [18]

    A framework for few-shot language model evaluation, September 2021

    Leo Gao, Jonathan Tow, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Gold- ing, Jeffrey Hsu, Kyle McDonell, Niklas Muennighoff, Jason Phang, Laria Reynolds, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. A framework for few-shot language model evaluation, September 2021

  19. [19]

    Realtoxici- typrompts: Evaluating neural toxic degeneration in language models

    Samuel Gehman, Suchin Gururangan, Maarten Sap, Yejin Choi, and Noah A Smith. Realtoxici- typrompts: Evaluating neural toxic degeneration in language models. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 3356–3369, 2020

  20. [20]

    Suchin Gururangan, Ana Marasovi ´c, Swabha Swayamdipta, Kyle Lo, Iz Beltagy, Doug Downey, and Noah A. Smith. Don’t stop pretraining: Adapt language models to domains and tasks. In Proceed- ings of the 58th Annual Meeting of the Association for Computational Linguistics , pages 8342–8360, Online, July 2020. Association for Computational Linguistics

  21. [21]

    Bowman, and Noah A

    Suchin Gururangan, Swabha Swayamdipta, Omer Levy, Roy Schwartz, Samuel R. Bowman, and Noah A. Smith. Annotation artifacts in natural language inference data. In Marilyn A. Walker, Heng Ji, and Amanda Stent, editors, Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologie...

  22. [22]

    Pretrained transformers improve out-of-distribution robustness

    Dan Hendrycks, Xiaoyuan Liu, Eric Wallace, Adam Dziedzic, Rishabh Krishnan, and Dawn Song. Pretrained transformers improve out-of-distribution robustness. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel R. Tetreault, editors, Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020,...

  23. [23]

    Gpipe: Efficient training of giant neural networks using pipeline parallelism

    Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Dehao Chen, Mia Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V Le, Yonghui Wu, et al. Gpipe: Efficient training of giant neural networks using pipeline parallelism. Advances in neural information processing systems, 32:103–112, 2019

  24. [24]

    Improving machine reading comprehension with single-choice decision and transfer learning

    Yufan Jiang, Shuangzhi Wu, Jing Gong, Yahui Cheng, Peng Meng, Weiliang Lin, Zhibo Chen, and Mu Li. Improving machine reading comprehension with single-choice decision and transfer learning. ArXiv, abs/2011.03292, 2020

  25. [25]

    Weld, and Luke Zettlemoyer

    Mandar Joshi, Eunsol Choi, Daniel S. Weld, and Luke Zettlemoyer. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. In ACL, 2017

  26. [26]

    Exploring the Limits of Language Modeling

    Rafal J ´ozefowicz, Oriol Vinyals, Mike Schuster, Noam M. Shazeer, and Yonghui Wu. Exploring the limits of language modeling. ArXiv, abs/1602.02410, 2016

  27. [27]

    Adam: A Method for Stochastic Optimization

    Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014

  28. [28]

    Gedi: Generative discriminator guided sequence generation, 2021

    Ben Krause, Akhilesh Deepak Gotmare, Bryan McCann, Nitish Shirish Keskar, Shafiq Joty, richard socher, and Nazneen Rajani. Gedi: Generative discriminator guided sequence generation, 2021

  29. [29]

    Dense-captioning events in videos

    Ranjay Krishna, Kenji Hata, Frederic Ren, Li Fei-Fei, and Juan Carlos Niebles. Dense-captioning events in videos. 2017 IEEE International Conference on Computer Vision (ICCV) , pages 706–715, 2017

  30. [30]

    Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov

    Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Al- berti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, Kristina Toutanova, Llion Jones, Matthew Kelcey, Ming-Wei Chang, Andrew M. Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov. Nat- ural Questions: A Benchmark for Question Answering Research. Tran...

  31. [31]

    RACE: Large-scale ReAd- ing comprehension dataset from examinations

    Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang, and Eduard Hovy. RACE: Large-scale ReAd- ing comprehension dataset from examinations. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 785–794, Copenhagen, Denmark, September 2017. Association for Computational Linguistics

  32. [32]

    GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding

    Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yan-Ping Huang, Maxim Krikun, Noam M. Shazeer, and Z. Chen. Gshard: Scaling giant models with conditional computation and automatic sharding. ArXiv, abs/2006.16668, 2021

  33. [33]

    The Power of Scale for Parameter-Efficient Prompt Tuning

    Brian Lester, Rami Al-Rfou, and Noah Constant. The power of scale for parameter-efficient prompt tuning. arXiv preprint arXiv:2104.08691, 2021

  34. [34]

    Jurassic-1: Technical details and evaluation

    Opher Lieber, Or Sharir, Barak Lenz, and Yoav Shoham. Jurassic-1: Technical details and evaluation. 30

  35. [35]

    Zhang, Jianwei Zhang, Xu Zou, Zhikang Li, Xiao Qing Deng, Jie Liu, Jinbao Xue, Huiling Zhou, Jianxin Ma, Jin Yu, Yonghong Li, Wei Lin, Jingren Zhou, J ie Tang, and Hongxia Yang

    Junyang Lin, Rui Men, An Yang, Chan Zhou, Ming Ding, Yichang Zhang, Peng Wang, Ang Wang, Le Jiang, Xianyan Jia, J. Zhang, Jianwei Zhang, Xu Zou, Zhikang Li, Xiao Qing Deng, Jie Liu, Jinbao Xue, Huiling Zhou, Jianxin Ma, Jin Yu, Yonghong Li, Wei Lin, Jingren Zhou, J ie Tang, and Hongxia Yang. M6: A chinese multimodal pretrainer. ArXiv, abs/2103.00823, 2021

  36. [36]

    M6-10t: A sharing-delinking paradigm for efficient multi-trillion parameter pretraining

    Junyang Lin, An Yang, Jinze Bai, Chang Zhou, Le Jiang, Xianyan Jia, Ang Wang, Jie Zhang, Yong Li, Wei Lin, Jingren Zhou, and Hongxia Yang. M6-10t: A sharing-delinking paradigm for efficient multi-trillion parameter pretraining. 2021

  37. [37]

    RoBERTa: A Robustly Optimized BERT Pretraining Approach

    Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019

  38. [38]

    Unicorn on rainbow: A universal commonsense reasoning model on a new multitask benchmark

    Nicholas Lourie, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Unicorn on rainbow: A universal commonsense reasoning model on a new multitask benchmark. In AAAI, 2021

  39. [39]

    Black is to criminal as cau- casian is to police: Detecting and removing multiclass bias in word embeddings

    Thomas Manzini, Lim Yao Chong, Alan W Black, and Yulia Tsvetkov. Black is to criminal as cau- casian is to police: Detecting and removing multiclass bias in word embeddings. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pa...

  40. [40]

    Right for the wrong reasons: Diagnosing syntactic heuris- tics in natural language inference

    Tom McCoy, Ellie Pavlick, and Tal Linzen. Right for the wrong reasons: Diagnosing syntactic heuris- tics in natural language inference. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3428–3448, Florence, Italy, July 2019. Association for Computa- tional Linguistics

  41. [41]

    Mixed Precision Training

    Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory Diamos, Erich Elsen, David Garcia, Boris Ginsburg, Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh, et al. Mixed precision training. arXiv preprint arXiv:1710.03740, 2017

  42. [42]

    Pipedream: generalized pipeline parallelism for dnn training

    Deepak Narayanan, Aaron Harlap, Amar Phanishayee, Vivek Seshadri, Nikhil R Devanur, Gregory R Ganger, Phillip B Gibbons, and Matei Zaharia. Pipedream: generalized pipeline parallelism for dnn training. In Proceedings of the 27th ACM Symposium on Operating Systems Principles , pages 1–15, 2019

  43. [43]

    Efficient large-scale language model training on gpu clusters using megatron-lm

    Deepak Narayanan, Mohammad Shoeybi, Jared Casper, Patrick LeGresley, Mostofa Patwary, Vi- jay Anand Korthikanti, Dmitri Vainbrand, Prethvi Kashinkunti, Julie Bernauer, Bryan Catan- zaroand Amar Phanishayee, and Matei Zaharia. Efficient large-scale language model training on gpu clusters using megatron-lm. ArXiv, abs/2104.04473, 2021

  44. [44]

    Mitigating harm in language models with conditional-likelihood filtration

    Helen Ngo, Cooper Raterink, Jo ˜ao GM Ara´ujo, Ivan Zhang, Carol Chen, Adrien Morisot, and Nicholas Frosst. Mitigating harm in language models with conditional-likelihood filtration. arXiv preprint arXiv:2108.07790, 2021

  45. [45]

    Nguyen and Julian Salazar

    Toan Q. Nguyen and Julian Salazar. Transformers without tears: Improving the normalization of self-attention. CoRR, abs/1910.05895, 2019

  46. [46]

    Adversarial nli: A new benchmark for natural language understanding

    Yixin Nie, Adina Williams, Emily Dinan, Mohit Bansal, Jason Weston, and Douwe Kiela. Adversarial nli: A new benchmark for natural language understanding. ArXiv, abs/1910.14599, 2020. 31

  47. [47]

    Adversarial NLI: A new benchmark for natural language understanding

    Yixin Nie, Adina Williams, Emily Dinan, Mohit Bansal, Jason Weston, and Douwe Kiela. Adversarial NLI: A new benchmark for natural language understanding. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel R. Tetreault, editors, Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020, ...

  48. [48]

    Asynchronous Pipeline for Processing Huge Corpora on Medium to Low Resource Infrastructures

    Pedro Javier Ortiz Su ´arez, Benoˆıt Sagot, and Laurent Romary. Asynchronous Pipeline for Processing Huge Corpora on Medium to Low Resource Infrastructures. In Piotr Ba´nski, Adrien Barbaresi, Hanno Biber, Evelyn Breiteneder, Simon Clematide, Marc Kupietz, Harald L ¨ungen, and Caroline Iliadi, edi- tors, 7th Workshop on the Challenges in the Management of...

  49. [49]

    The LAMBADA dataset: Word prediction requiring a broad discourse context

    Denis Paperno, Germ ´an Kruszewski, Angeliki Lazaridou, Ngoc Quan Pham, Raffaella Bernardi, San- dro Pezzelle, Marco Baroni, Gemma Boleda, and Raquel Fern ´andez. The LAMBADA dataset: Word prediction requiring a broad discourse context. InProceedings of the 54th Annual Meeting of the Asso- ciation for Computational Linguistics (Volume 1: Long Papers) , pa...

  50. [50]

    Wic: the word-in-context dataset for evalu- ating context-sensitive meaning representations

    Mohammad Taher Pilehvar and Jos ´e Camacho-Collados. Wic: the word-in-context dataset for evalu- ating context-sensitive meaning representations. In NAACL, 2019

  51. [51]

    Sentiwordnet, 2021

    Christopher Potts. Sentiwordnet, 2021

  52. [52]

    Language models are unsupervised multitask learners

    Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. 2019

  53. [53]

    Jack W. Rae, Sebastian Borgeaud, Trevor Cai, Katie Millican, Jordan Hoffmann, Francis Song, John Aslanides, Sarah Henderson, Roman Ring, Susannah Young, Eliza Rutherford, Tom Hennigan, Jacob Menick, Albin Cassirer, Richard Powell, George van den Driessche, Lisa Anne Hendricks, Maribeth Rauh, Po-Sen Huang, Amelia Glaese, Johannes Welbl, Sumanth Dathathri, ...

  54. [54]

    Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

    Colin Raffel, Noam Shazeer, et al. Exploring the Limits of Transfer Learning with a Unified Text-to- Text Transformer. ArXiv, abs/1910.10683, 2019

  55. [55]

    Zero: Memory optimizations toward training trillion parameter models

    Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. Zero: Memory optimizations toward training trillion parameter models. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–16. IEEE, 2020. 32

  56. [56]

    Zero-infinity: Breaking the gpu memory wall for extreme scale deep learning

    Samyam Rajbhandari, Olatunji Ruwase, Jeff Rasley, Shaden Smith, and Yuxiong He. Zero-infinity: Breaking the gpu memory wall for extreme scale deep learning. arXiv preprint arXiv:2104.07857 , 2021

  57. [57]

    Deepspeed: System optimiza- tions enable training deep learning models with over 100 billion parameters

    Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase, and Yuxiong He. Deepspeed: System optimiza- tions enable training deep learning models with over 100 billion parameters. InProceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 3505–3506, 2020

  58. [58]

    Winogrande: An adver- sarial winograd schema challenge at scale

    Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Winogrande: An adver- sarial winograd schema challenge at scale. In AAAI, 2020

  59. [59]

    Multitask Prompted Training Enables Zero-Shot Task Generalization

    Victor Sanh, Albert Webson, Colin Raffel, Stephen H. Bach, Lintang A. Sutawika, Zaid Alyafeai, Antoine Chaffin, Arnaud Stiegler, Teven Le Scao, Arun Raja, Manan Dey, M SAIFUL BARI, Canwen Xu, Urmish Thakker, Shanya Sharma Sharma, Eliza Szczechla, Taewoon Kim, Gunjan Chhablani, Nihal V . Nayak, Debajyoti Datta, Jonathan Chang, Mike Tian-Jian Jiang, Han Wang...

  60. [60]

    Self-diagnosis and self-debiasing: A proposal for reducing corpus-based bias in nlp

    Timo Schick, Sahana Udupa, and Hinrich Sch ¨utze. Self-diagnosis and self-debiasing: A proposal for reducing corpus-based bias in nlp. arXiv preprint arXiv:2103.00453, 2021

  61. [61]

    Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

    Noam M. Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc V . Le, Geoffrey E. Hin- ton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. ArXiv, abs/1701.06538, 2017

  62. [62]

    The woman worked as a babysitter: On biases in language generation

    Emily Sheng, Kai-Wei Chang, Premkumar Natarajan, and Nanyun Peng. The woman worked as a babysitter: On biases in language generation. In Proceedings of the 2019 Conference on Empiri- cal Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3407–3412, Hong Kong, China, Novemb...

  63. [63]

    Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

    Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catan- zaro. Megatron-lm: Training multi-billion parameter language models using model parallelism.CoRR, abs/1909.08053, 2019

  64. [64]

    Robyn Speer. ftfy. Zenodo, 2019. Version 5.5

  65. [65]

    DeepSpeed

    DeepSpeed Team. DeepSpeed. https://github.com/microsoft/DeepSpeed, 2021

  66. [66]

    Trinh and Quoc V

    Trieu H. Trinh and Quoc V . Le. A simple method for commonsense reasoning.CoRR, abs/1806.02847, 2018

  67. [67]

    Attention Is All You Need

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. CoRR, abs/1706.03762, 2017

  68. [68]

    Jia, Bo Li, and Jingjing Liu

    Boxin Wang, Shuohang Wang, Yu Cheng, Zhe Gan, R. Jia, Bo Li, and Jingjing Liu. Infobert: Improv- ing robustness of language models from an information theoretic perspective. ArXiv, abs/2010.02329, 2021. 33

  69. [69]

    Towards zero-label language learning.ArXiv, abs/2109.09193, 2021

    Zirui Wang, Adams Wei Yu, Orhan Firat, and Yuan Cao. Towards zero-label language learning.ArXiv, abs/2109.09193, 2021

  70. [70]

    Finetuned Language Models Are Zero-Shot Learners

    Jason Wei, Maarten Bosma, Vincent Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M. Dai, and Quoc V . Le. Finetuned language models are zero-shot learners. ArXiv, abs/2109.01652, 2021

  71. [71]

    Ethical and social risks of harm from Language Models

    Laura Weidinger, John Mellor, Maribeth Rauh, Conor Griffin, Jonathan Uesato, Po-Sen Huang, Myra Cheng, Mia Glaese, Borja Balle, Atoosa Kasirzadeh, Zac Kenton, Sasha Brown, Will Hawkins, Tom Stepleton, Courtney Biles, Abeba Birhane, Julia Haas, Laura Rimell, Lisa Anne Hendricks, William Isaac, Sean Legassick, Geoffrey Irving, and Iason Gabriel. Ethical and ...

  72. [72]

    Challenges in detoxifying language models

    Johannes Welbl, Amelia Glaese, Jonathan Uesato, Sumanth Dathathri, John Mellor, Lisa Anne Hen- dricks, Kirsty Anderson, Pushmeet Kohli, Ben Coppin, and Po-Sen Huang. Challenges in detoxifying language models. arXiv preprint arXiv:2109.07445, 2021

  73. [73]

    A broad-coverage challenge corpus for sen- tence understanding through inference

    Adina Williams, Nikita Nangia, and Samuel Bowman. A broad-coverage challenge corpus for sen- tence understanding through inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 1112–1122. Association for Computational Linguis...

  74. [74]

    Yuan 1.0: Large-scale pre-trained language model in zero-shot and few-shot learning

    Shaohua Wu, Xudong Zhao, Tong Yu, Rongguo Zhang, Chong Shen, Hongli Liu, Feng Li, Hong Zhu, Jiangang Luo, Liang Xu, and Xuanwei Zhang. Yuan 1.0: Large-scale pre-trained language model in zero-shot and few-shot learning. ArXiv, abs/2110.04725, 2021

  75. [75]

    Learning and Evaluating General Linguistic Intelligence

    Dani Yogatama, Cyprien de Masson d’Autume, Jerome T. Connor, Tom ´as Kocisk ´y, Mike Chrzanowski, Lingpeng Kong, Angeliki Lazaridou, Wang Ling, Lei Yu, Chris Dyer, and Phil Blunsom. Learning and evaluating general linguistic intelligence. CoRR, abs/1901.11373, 2019

  76. [76]

    Hellaswag: Can a machine really finish your sentence? In ACL, 2019

    Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence? In ACL, 2019

  77. [77]

    Defending against neural fake news

    Rowan Zellers, Ari Holtzman, Hannah Rashkin, Yonatan Bisk, Ali Farhadi, Franziska Roesner, and Yejin Choi. Defending against neural fake news. CoRR, abs/1905.12616, 2019

  78. [78]

    subcases

    Wei Zeng, Xiaozhe Ren, Teng Su, Hui Wang, Yi Liao, Zhiwei Wang, Xin Jiang, ZhenZhang Yang, Kaisheng Wang, Xiaoda Zhang, Chen Li, Ziyan Gong, Yifan Yao, Xinjing Huang, Jun Wang, Jian- feng Yu, Qilong Guo, Yue Yu, Yan Zhang, Jin Wang, Heng Tao, Dasen Yan, Zexuan Yi, Fang Peng, Fan Jiang, Han Zhang, Lingfeng Deng, Yehong Zhang, Zhengping Lin, Chao Zhang, Sha...