Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model
Pith reviewed 2026-05-24 12:09 UTC · model grok-4.3
The pith
A 530 billion parameter transformer model trained via DeepSpeed and Megatron sets new state-of-the-art results on zero-, one-, and few-shot NLP benchmarks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We present details on the training of the largest monolithic transformer based language model, Megatron-Turing NLG 530B (MT-NLG), with 530 billion parameters. Using DeepSpeed and Megatron, we employ a 3D parallelism methodology to enable training at this scale. The design of the training corpus and data curation techniques, which we believe is a key ingredient to the success of the model, allow MT-NLG to achieve superior zero-, one-, and few-shot learning accuracies on several NLP benchmarks and establish new state-of-the-art results.
What carries the argument
3D parallelism (data, model, and pipeline) implemented in DeepSpeed and Megatron that distributes the 530 billion parameter transformer across hardware while maintaining training stability.
If this is right
- The infrastructure details enable training of monolithic models at hundreds of billions of parameters.
- Data curation techniques directly contribute to the observed generalization in zero- and few-shot regimes.
- MT-NLG exhibits new properties in natural language generation that prior smaller models did not display.
- The same training stack can be reused to push model size further while retaining benchmark gains.
Where Pith is reading between the lines
- Similar parallelism and curation patterns could be applied to multimodal models that combine text with images or code.
- The reported scaling behavior suggests that further increases in parameter count may continue to improve few-shot performance without architectural changes.
- Open release of the exact corpus composition would allow independent verification of the data-curation hypothesis.
Load-bearing premise
The design of the training corpus and the data curation techniques are a key ingredient to the success of the model.
What would settle it
A controlled replication that trains an identical 530B model on the same hardware and code but with a standard public corpus lacking the described curation steps, then measures whether zero- and few-shot benchmark scores fall below the reported levels.
Figures
read the original abstract
Pretrained general-purpose language models can achieve state-of-the-art accuracies in various natural language processing domains by adapting to downstream tasks via zero-shot, few-shot and fine-tuning techniques. Because of their success, the size of these models has increased rapidly, requiring high-performance hardware, software, and algorithmic techniques to enable training such large models. As the result of a joint effort between Microsoft and NVIDIA, we present details on the training of the largest monolithic transformer based language model, Megatron-Turing NLG 530B (MT-NLG), with 530 billion parameters. In this paper, we first focus on the infrastructure as well as the 3D parallelism methodology used to train this model using DeepSpeed and Megatron. Next, we detail the training process, the design of our training corpus, and our data curation techniques, which we believe is a key ingredient to the success of the model. Finally, we discuss various evaluation results, as well as other interesting observations and new properties exhibited by MT-NLG. We demonstrate that MT-NLG achieves superior zero-, one-, and few-shot learning accuracies on several NLP benchmarks and establishes new state-of-the-art results. We believe that our contributions will help further the development of large-scale training infrastructures, large-scale language models, and natural language generations.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper describes the joint Microsoft-NVIDIA effort to train Megatron-Turing NLG 530B (MT-NLG), a 530-billion-parameter monolithic transformer language model. It details the infrastructure and 3D parallelism techniques implemented with DeepSpeed and Megatron, the training process, the design and curation of the training corpus, and reports that MT-NLG achieves superior zero-, one-, and few-shot accuracies on multiple NLP benchmarks, establishing new state-of-the-art results.
Significance. If the performance claims hold under comparable evaluation conditions, the work provides a valuable engineering record of scaling transformer training to 530B parameters. The explicit treatment of 3D parallelism, data curation practices, and infrastructure choices offers concrete guidance for future large-scale training efforts. The paper also surfaces observations about emergent properties of the model.
major comments (2)
- [Evaluation] Evaluation section: the manuscript asserts new state-of-the-art zero-/one-/few-shot results on several NLP benchmarks, yet supplies no benchmark numbers, baselines, or statistical details in the abstract and does not reference a fixed evaluation harness. This absence prevents verification that reported margins are protocol-independent rather than arising from prompt or decoding differences.
- [Evaluation] Evaluation section: the paper does not provide the exact prompt templates, number of shots, or decoding settings used for each benchmark where new SOTA is claimed. Without these, it is impossible to confirm that the superiority is attributable to model scale or data rather than evaluation-protocol variations relative to prior work.
minor comments (1)
- [Abstract] The abstract would be strengthened by including at least one or two concrete benchmark numbers and the corresponding prior SOTA values.
Simulated Author's Rebuttal
We thank the referee for the detailed review and constructive comments on the evaluation section. We agree that greater transparency in benchmark reporting, baselines, and protocol details will improve verifiability. We will revise the manuscript to address these points while preserving the paper's focus on training infrastructure and data curation.
read point-by-point responses
-
Referee: [Evaluation] Evaluation section: the manuscript asserts new state-of-the-art zero-/one-/few-shot results on several NLP benchmarks, yet supplies no benchmark numbers, baselines, or statistical details in the abstract and does not reference a fixed evaluation harness. This absence prevents verification that reported margins are protocol-independent rather than arising from prompt or decoding differences.
Authors: We acknowledge that the abstract contains only a high-level claim without numerical results or harness details, which is typical for length-constrained abstracts but can reduce immediate verifiability. The main Evaluation section does include comparative tables against prior models; however, to strengthen the paper we will (1) add a concise summary of key benchmark scores and baselines to the abstract where space permits, (2) explicitly name the evaluation harness and any custom adaptations in the Evaluation section, and (3) include error bars or statistical notes where multiple runs were performed. These changes will be made in the revised manuscript. revision: yes
-
Referee: [Evaluation] Evaluation section: the paper does not provide the exact prompt templates, number of shots, or decoding settings used for each benchmark where new SOTA is claimed. Without these, it is impossible to confirm that the superiority is attributable to model scale or data rather than evaluation-protocol variations relative to prior work.
Authors: We agree that reproducibility of the reported SOTA claims requires the precise prompts, shot counts, and decoding parameters. The current manuscript references standard few-shot setups for the cited benchmarks but does not reproduce the templates. In the revision we will add a dedicated appendix (or subsection) that lists, for every benchmark where a new SOTA is claimed: the exact prompt template, number of shots, decoding strategy (e.g., greedy, nucleus sampling parameters), and any post-processing steps. This will allow direct comparison with prior work and confirm that gains are not protocol artifacts. revision: yes
Circularity Check
Empirical training report with no derivation chain
full rationale
The manuscript is an engineering report detailing hardware/software infrastructure (DeepSpeed + Megatron 3D parallelism), training corpus construction, data curation, and measured benchmark accuracies for the 530B model. No equations, fitted parameters, or predictions are presented that reduce by construction to the paper's own inputs. Evaluation results are reported as direct empirical outcomes rather than derived quantities. No self-citation load-bearing steps, ansatz smuggling, or uniqueness theorems appear in the derivation chain. The paper is self-contained as a factual account of a training run.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Transformer-based language models scale effectively with parameter count and data quality
- domain assumption 3D parallelism from DeepSpeed and Megatron can be applied without fundamental bottlenecks at this scale
Forward citations
Cited by 42 Pith papers
-
Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling
Pythia releases 16 identically trained LLMs with full checkpoints and data tools to study training dynamics, scaling, memorization, and bias in language models.
-
Analyzing Reverse Address Translation Overheads in Multi-GPU Scale-Up Pods
Simulation study shows cold TLB misses in reverse address translation dominate latency for small collectives in multi-GPU pods, causing up to 1.4x degradation, while larger ones see diminishing returns.
-
Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs
o1-like models overthink easy tasks; self-training reduces compute use without accuracy loss on GSM8K, MATH500, GPQA, and AIME.
-
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
DeepSeek-V2 delivers top-tier open-source LLM performance using only 21B active parameters by compressing the KV cache 93.3% and cutting training costs 42.5% via MLA and DeepSeekMoE.
-
Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks
PoT prompting improves numerical reasoning by having language models write programs executed by a computer instead of performing calculations in natural language chains of thought, with an average 12% gain over CoT.
-
Large Language Models are Zero-Shot Reasoners
Adding the fixed prompt 'Let's think step by step' enables large language models to achieve substantial zero-shot gains on arithmetic, symbolic, and logical reasoning benchmarks without any task-specific examples.
-
OPT: Open Pre-trained Transformer Language Models
OPT releases open decoder-only transformers up to 175B parameters that match GPT-3 performance at one-seventh the carbon cost, along with code and training logs.
-
M$^2$RNN: Non-Linear RNNs with Matrix-Valued States for Scalable Language Modeling
M²RNN achieves perfect state tracking at unseen lengths and outperforms Gated DeltaNet hybrids by 0.4-0.5 perplexity on 7B models with 3x smaller recurrent states.
-
veScale-FSDP: Flexible and High-Performance FSDP at Scale
veScale-FSDP uses RaggedShard and structure-aware planning to support block-wise quantization and non-element-wise optimizers while delivering 5-66% higher throughput and 16-30% lower memory than prior FSDP systems at...
-
Chameleon: Adaptive Fault Tolerance for Distributed Training via Real-time Policy Selection
Chameleon provides adaptive fault tolerance for distributed training by real-time selection of optimal recovery policies via a unified performance model, demonstrated with low overhead on a 32-card cluster.
-
Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning
Math reasoning gains in LLMs rarely transfer to general domains; RL tuning generalizes while SFT causes forgetting and representation drift.
-
MiniMax-01: Scaling Foundation Models with Lightning Attention
MiniMax-01 models match GPT-4o and Claude-3.5-Sonnet performance while providing 20-32 times longer context windows through lightning attention and MoE scaling.
-
The Falcon Series of Open Language Models
Falcon-180B is a 180B-parameter open decoder-only model trained on 3.5 trillion tokens that approaches PaLM-2-Large performance at lower cost and is released with dataset extracts.
-
DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models
DeepSpeed-Ulysses keeps communication volume constant for sequence-parallel attention when sequence length and device count scale together, delivering 2.5x faster training on 4x longer sequences than prior SOTA.
-
Scaling Data-Constrained Language Models
Repeating training data up to 4 epochs yields negligible loss increase versus unique data for fixed compute, and a new scaling law accounts for the decaying value of repeated tokens and excess parameters.
-
Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes
Distilling step-by-step uses LLM-generated rationales as additional supervision in a multi-task framework so that 770M-parameter models outperform 540B-parameter models on NLP benchmarks with only 80% of the data.
-
MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models
MiniGPT-4 shows that aligning a frozen vision encoder to Vicuna via one projection layer plus a second-stage detailed-description fine-tune produces GPT-4-like vision-language abilities including detailed captions, cr...
-
BloombergGPT: A Large Language Model for Finance
BloombergGPT is a 50B parameter LLM trained on a 708B token mixed financial and general dataset that outperforms prior models on financial benchmarks while preserving general LLM performance.
-
Language Models can Solve Computer Tasks
Pre-trained LLMs using recursive criticism and improvement prompting achieve state-of-the-art results on the MiniWoB++ computer task benchmark with only a handful of demonstrations and no task-specific reward function.
-
FP8 Formats for Deep Learning
FP8 formats E4M3 and E5M2 match 16-bit training accuracy on CNNs, RNNs, and Transformers up to 175B parameters without hyperparameter changes.
-
Atlas: Few-shot Learning with Retrieval Augmented Language Models
Atlas reaches over 42% accuracy on Natural Questions with only 64 examples, outperforming a 540B-parameter model by 3% with 50x fewer parameters.
-
Efficient Training of Language Models to Fill in the Middle
Autoregressive language models trained on data with middle spans relocated to the end learn infilling without degrading left-to-right perplexity or sampling quality.
-
MRKL Systems: A modular, neuro-symbolic architecture that combines large language models, external knowledge sources and discrete reasoning
MRKL is a modular neuro-symbolic architecture that integrates LLMs with external knowledge and discrete reasoning to overcome limitations of pure neural language models.
-
GPT-NeoX-20B: An Open-Source Autoregressive Language Model
GPT-NeoX-20B is a publicly released 20B parameter autoregressive language model trained on the Pile that shows strong gains in five-shot reasoning over similarly sized prior models.
-
PaLM: Scaling Language Modeling with Pathways
PaLM 540B demonstrates continued scaling benefits by setting new few-shot SOTA results on hundreds of benchmarks and outperforming humans on BIG-bench.
-
Charon: A Unified and Fine-Grained Simulator for Large-Scale LLM Training and Inference
Charon is a unified modular simulator that predicts LLM training and inference performance with under 5.35% error and identifies throughput improvements over baselines in a real deployment case.
-
Charon: A Unified and Fine-Grained Simulator for Large-Scale LLM Training and Inference
Charon is a unified fine-grained simulator that predicts LLM performance with under 5.35% error overall and under 3.74% for large-scale training, and it found a better inference configuration than an engineering baseline.
-
Transforming the Use of Earth Observation Data: Exascale Training of a Generative Compression Model with Historical Priors for up to 10,000x Data Reduction
A generative compression model using historical priors for Earth observation data achieves up to 10,000x reduction after exascale training on an Armv9 supercomputer.
-
TACO: Efficient Communication Compression of Intermediate Tensors for Scalable Tensor-Parallel LLM Training
TACO compresses tensor-parallel intermediate tensors with an adaptive FP8 scheme and fused kernels, yielding up to 1.87X throughput gains on GPT and Qwen models with near-lossless accuracy.
-
SparseBalance: Load-Balanced Long Context Training with Dynamic Sparse Attention
SparseBalance dynamically adjusts sparsity and batches workloads to load-balance sparse attention training, delivering up to 1.33x speedup and 0.46% better long-context performance on LongBench.
-
SEDD: Scalable and Efficient Dataset Deduplication with GPUs
SEDD delivers a distributed GPU deduplication system that reports up to 158x speedup over CPU baselines and 7.8x over NeMo Curator on 30M documents while preserving MinHash fidelity above 0.95 Jaccard.
-
MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning
MiniGPT-v2 adds unique task identifiers to a large language model so one system can perform image description, visual question answering, and visual grounding after three-stage training.
-
StarCoder: may the source be with you!
StarCoderBase matches or beats OpenAI's code-cushman-001 on multi-language code benchmarks; the Python-fine-tuned StarCoder reaches 40% pass@1 on HumanEval while retaining other-language performance.
-
RAFT: Reward rAnked FineTuning for Generative Foundation Model Alignment
RAFT aligns generative models by ranking samples with a reward model and fine-tuning only on the top-ranked outputs, reporting gains on reward scores and automated metrics for LLMs and diffusion models.
-
Cross-Layer Energy Analysis of Multimodal Training on Grace Hopper Superchips
On Grace Hopper superchips, energy efficiency during multimodal training is governed by data movement and overlap rather than compute utilization, and runtime-optimal configurations are not always energy-optimal.
-
DeepSeek LLM: Scaling Open-Source Language Models with Longtermism
DeepSeek LLM 67B exceeds LLaMA-2 70B on code, mathematics and reasoning benchmarks after pre-training on 2 trillion tokens and alignment via SFT and DPO.
-
Phoenix-VL 1.5 Medium Technical Report
Phoenix-VL 1.5 Medium is a 123B-parameter natively multimodal model that reaches state-of-the-art results on Singapore multimodal, legal, and policy benchmarks after localized training on 1T+ tokens while staying comp...
-
A Scalable Recipe on SuperMUC-NG Phase 2: Efficient Large-Scale Training of Language Models
A combined parallelism recipe on SuperMUC-NG Phase 2 delivers 10% of theoretical peak throughput for 175B models plus 93% weak and 82% strong scaling efficiency on 128 nodes using unmodified public software.
-
Towards EnergyGPT: A Large Language Model Specialized for the Energy Sector
Fine-tuned LLaMA 3.1-8B variants for the energy sector outperform the base model on domain QA benchmarks, with LoRA delivering similar gains at lower training cost.
-
Large Language Models: A Survey
The paper surveys key large language models, their training methods, datasets, evaluation benchmarks, and future research directions in the field.
-
A Survey of Large Language Models
This survey reviews the background, key techniques, and evaluation methods for large language models, emphasizing emergent abilities that appear at large scales.
-
A Comprehensive Overview of Large Language Models
A survey paper providing an overview of Large Language Models, their background, and recent advances in the field.
Reference graph
Works this paper leans on
-
[1]
https://www.nvidia.com/en-us/data-center/a100/
NVIDIA A100 Tensor Core GPU. https://www.nvidia.com/en-us/data-center/a100/
-
[2]
https://www.top500.org/system/179842/
NVIDIA Selene Supercomputer. https://www.top500.org/system/179842/
-
[3]
https://www.nvidia.com/en-us/data-center/nvlink/
NVLink and NVSwitch. https://www.nvidia.com/en-us/data-center/nvlink/
-
[4]
Turing-NLG: A 17-billion-parameter language model by Mi- crosoft. https://www.microsoft.com/en-us/research/blog/ turing-nlg-a-17-billion-parameter-language-model-by-microsoft/
- [5]
-
[6]
Piqa: Reasoning about physical commonsense in natural language
Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng Gao, and Yejin Choi. Piqa: Reasoning about physical commonsense in natural language. In AAAI, 2020
work page 2020
-
[7]
Zou, Venkatesh Saligrama, and Adam Tauman Kalai
Tolga Bolukbasi, Kai-Wei Chang, James Y . Zou, Venkatesh Saligrama, and Adam Tauman Kalai. Man is to computer programmer as woman is to homemaker? debiasing word embeddings. In NIPS, 2016
work page 2016
-
[8]
On the Opportunities and Risks of Foundation Models
Rishi Bommasani, Drew A Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, et al. On the opportuni- ties and risks of foundation models. arXiv preprint arXiv:2108.07258, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[9]
Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert- V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litw...
work page 2020
-
[10]
BoolQ: Exploring the surprising difficulty of natural yes/no questions
Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. BoolQ: Exploring the surprising difficulty of natural yes/no questions. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), ...
work page 2019
- [11]
-
[12]
Bert: Pre-training of deep bidirectional transformers for language understanding
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In NAACL, 2019
work page 2019
-
[13]
Penelope Eckert and Sally McConnell-Ginet. Language and Gender . Cambridge University Press, 2003
work page 2003
-
[14]
Improving gender fairness of pre-trained language models without catastrophic forgetting
Zahra Fatemi, Chen Xing, Wenhao Liu, and Caiming Xiong. Improving gender fairness of pre-trained language models without catastrophic forgetting. arXiv preprint arXiv:2110.05367, 2021
-
[15]
William Fedus, Barret Zoph, and Noam M. Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.ArXiv, abs/2101.03961, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[16]
Lyn Frazier and Janet D. Fodor. The sausage machine: A new two-stage parsing model. Cognition, 6(4):291–325, 1978. Place: Netherlands Publisher: Elsevier Science
work page 1978
-
[17]
The Pile: An 800GB Dataset of Diverse Text for Language Modeling
Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, Shawn Presser, and Connor Leahy. The Pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2020
-
[18]
A framework for few-shot language model evaluation, September 2021
Leo Gao, Jonathan Tow, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Gold- ing, Jeffrey Hsu, Kyle McDonell, Niklas Muennighoff, Jason Phang, Laria Reynolds, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. A framework for few-shot language model evaluation, September 2021
work page 2021
-
[19]
Realtoxici- typrompts: Evaluating neural toxic degeneration in language models
Samuel Gehman, Suchin Gururangan, Maarten Sap, Yejin Choi, and Noah A Smith. Realtoxici- typrompts: Evaluating neural toxic degeneration in language models. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 3356–3369, 2020
work page 2020
-
[20]
Suchin Gururangan, Ana Marasovi ´c, Swabha Swayamdipta, Kyle Lo, Iz Beltagy, Doug Downey, and Noah A. Smith. Don’t stop pretraining: Adapt language models to domains and tasks. In Proceed- ings of the 58th Annual Meeting of the Association for Computational Linguistics , pages 8342–8360, Online, July 2020. Association for Computational Linguistics
work page 2020
-
[21]
Suchin Gururangan, Swabha Swayamdipta, Omer Levy, Roy Schwartz, Samuel R. Bowman, and Noah A. Smith. Annotation artifacts in natural language inference data. In Marilyn A. Walker, Heng Ji, and Amanda Stent, editors, Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologie...
work page 2018
-
[22]
Pretrained transformers improve out-of-distribution robustness
Dan Hendrycks, Xiaoyuan Liu, Eric Wallace, Adam Dziedzic, Rishabh Krishnan, and Dawn Song. Pretrained transformers improve out-of-distribution robustness. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel R. Tetreault, editors, Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020,...
work page 2020
-
[23]
Gpipe: Efficient training of giant neural networks using pipeline parallelism
Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Dehao Chen, Mia Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V Le, Yonghui Wu, et al. Gpipe: Efficient training of giant neural networks using pipeline parallelism. Advances in neural information processing systems, 32:103–112, 2019
work page 2019
-
[24]
Improving machine reading comprehension with single-choice decision and transfer learning
Yufan Jiang, Shuangzhi Wu, Jing Gong, Yahui Cheng, Peng Meng, Weiliang Lin, Zhibo Chen, and Mu Li. Improving machine reading comprehension with single-choice decision and transfer learning. ArXiv, abs/2011.03292, 2020
-
[25]
Mandar Joshi, Eunsol Choi, Daniel S. Weld, and Luke Zettlemoyer. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. In ACL, 2017
work page 2017
-
[26]
Exploring the Limits of Language Modeling
Rafal J ´ozefowicz, Oriol Vinyals, Mike Schuster, Noam M. Shazeer, and Yonghui Wu. Exploring the limits of language modeling. ArXiv, abs/1602.02410, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[27]
Adam: A Method for Stochastic Optimization
Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[28]
Gedi: Generative discriminator guided sequence generation, 2021
Ben Krause, Akhilesh Deepak Gotmare, Bryan McCann, Nitish Shirish Keskar, Shafiq Joty, richard socher, and Nazneen Rajani. Gedi: Generative discriminator guided sequence generation, 2021
work page 2021
-
[29]
Dense-captioning events in videos
Ranjay Krishna, Kenji Hata, Frederic Ren, Li Fei-Fei, and Juan Carlos Niebles. Dense-captioning events in videos. 2017 IEEE International Conference on Computer Vision (ICCV) , pages 706–715, 2017
work page 2017
-
[30]
Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov
Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Al- berti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, Kristina Toutanova, Llion Jones, Matthew Kelcey, Ming-Wei Chang, Andrew M. Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov. Nat- ural Questions: A Benchmark for Question Answering Research. Tran...
work page 2019
-
[31]
RACE: Large-scale ReAd- ing comprehension dataset from examinations
Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang, and Eduard Hovy. RACE: Large-scale ReAd- ing comprehension dataset from examinations. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 785–794, Copenhagen, Denmark, September 2017. Association for Computational Linguistics
work page 2017
-
[32]
GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding
Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yan-Ping Huang, Maxim Krikun, Noam M. Shazeer, and Z. Chen. Gshard: Scaling giant models with conditional computation and automatic sharding. ArXiv, abs/2006.16668, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2006
-
[33]
The Power of Scale for Parameter-Efficient Prompt Tuning
Brian Lester, Rami Al-Rfou, and Noah Constant. The power of scale for parameter-efficient prompt tuning. arXiv preprint arXiv:2104.08691, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[34]
Jurassic-1: Technical details and evaluation
Opher Lieber, Or Sharir, Barak Lenz, and Yoav Shoham. Jurassic-1: Technical details and evaluation. 30
-
[35]
Junyang Lin, Rui Men, An Yang, Chan Zhou, Ming Ding, Yichang Zhang, Peng Wang, Ang Wang, Le Jiang, Xianyan Jia, J. Zhang, Jianwei Zhang, Xu Zou, Zhikang Li, Xiao Qing Deng, Jie Liu, Jinbao Xue, Huiling Zhou, Jianxin Ma, Jin Yu, Yonghong Li, Wei Lin, Jingren Zhou, J ie Tang, and Hongxia Yang. M6: A chinese multimodal pretrainer. ArXiv, abs/2103.00823, 2021
-
[36]
M6-10t: A sharing-delinking paradigm for efficient multi-trillion parameter pretraining
Junyang Lin, An Yang, Jinze Bai, Chang Zhou, Le Jiang, Xianyan Jia, Ang Wang, Jie Zhang, Yong Li, Wei Lin, Jingren Zhou, and Hongxia Yang. M6-10t: A sharing-delinking paradigm for efficient multi-trillion parameter pretraining. 2021
work page 2021
-
[37]
RoBERTa: A Robustly Optimized BERT Pretraining Approach
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1907
-
[38]
Unicorn on rainbow: A universal commonsense reasoning model on a new multitask benchmark
Nicholas Lourie, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Unicorn on rainbow: A universal commonsense reasoning model on a new multitask benchmark. In AAAI, 2021
work page 2021
-
[39]
Thomas Manzini, Lim Yao Chong, Alan W Black, and Yulia Tsvetkov. Black is to criminal as cau- casian is to police: Detecting and removing multiclass bias in word embeddings. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pa...
work page 2019
-
[40]
Right for the wrong reasons: Diagnosing syntactic heuris- tics in natural language inference
Tom McCoy, Ellie Pavlick, and Tal Linzen. Right for the wrong reasons: Diagnosing syntactic heuris- tics in natural language inference. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3428–3448, Florence, Italy, July 2019. Association for Computa- tional Linguistics
work page 2019
-
[41]
Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory Diamos, Erich Elsen, David Garcia, Boris Ginsburg, Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh, et al. Mixed precision training. arXiv preprint arXiv:1710.03740, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[42]
Pipedream: generalized pipeline parallelism for dnn training
Deepak Narayanan, Aaron Harlap, Amar Phanishayee, Vivek Seshadri, Nikhil R Devanur, Gregory R Ganger, Phillip B Gibbons, and Matei Zaharia. Pipedream: generalized pipeline parallelism for dnn training. In Proceedings of the 27th ACM Symposium on Operating Systems Principles , pages 1–15, 2019
work page 2019
-
[43]
Efficient large-scale language model training on gpu clusters using megatron-lm
Deepak Narayanan, Mohammad Shoeybi, Jared Casper, Patrick LeGresley, Mostofa Patwary, Vi- jay Anand Korthikanti, Dmitri Vainbrand, Prethvi Kashinkunti, Julie Bernauer, Bryan Catan- zaroand Amar Phanishayee, and Matei Zaharia. Efficient large-scale language model training on gpu clusters using megatron-lm. ArXiv, abs/2104.04473, 2021
-
[44]
Mitigating harm in language models with conditional-likelihood filtration
Helen Ngo, Cooper Raterink, Jo ˜ao GM Ara´ujo, Ivan Zhang, Carol Chen, Adrien Morisot, and Nicholas Frosst. Mitigating harm in language models with conditional-likelihood filtration. arXiv preprint arXiv:2108.07790, 2021
-
[45]
Toan Q. Nguyen and Julian Salazar. Transformers without tears: Improving the normalization of self-attention. CoRR, abs/1910.05895, 2019
-
[46]
Adversarial nli: A new benchmark for natural language understanding
Yixin Nie, Adina Williams, Emily Dinan, Mohit Bansal, Jason Weston, and Douwe Kiela. Adversarial nli: A new benchmark for natural language understanding. ArXiv, abs/1910.14599, 2020. 31
-
[47]
Adversarial NLI: A new benchmark for natural language understanding
Yixin Nie, Adina Williams, Emily Dinan, Mohit Bansal, Jason Weston, and Douwe Kiela. Adversarial NLI: A new benchmark for natural language understanding. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel R. Tetreault, editors, Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020, ...
work page 2020
-
[48]
Asynchronous Pipeline for Processing Huge Corpora on Medium to Low Resource Infrastructures
Pedro Javier Ortiz Su ´arez, Benoˆıt Sagot, and Laurent Romary. Asynchronous Pipeline for Processing Huge Corpora on Medium to Low Resource Infrastructures. In Piotr Ba´nski, Adrien Barbaresi, Hanno Biber, Evelyn Breiteneder, Simon Clematide, Marc Kupietz, Harald L ¨ungen, and Caroline Iliadi, edi- tors, 7th Workshop on the Challenges in the Management of...
work page 2019
-
[49]
The LAMBADA dataset: Word prediction requiring a broad discourse context
Denis Paperno, Germ ´an Kruszewski, Angeliki Lazaridou, Ngoc Quan Pham, Raffaella Bernardi, San- dro Pezzelle, Marco Baroni, Gemma Boleda, and Raquel Fern ´andez. The LAMBADA dataset: Word prediction requiring a broad discourse context. InProceedings of the 54th Annual Meeting of the Asso- ciation for Computational Linguistics (Volume 1: Long Papers) , pa...
work page 2016
-
[50]
Wic: the word-in-context dataset for evalu- ating context-sensitive meaning representations
Mohammad Taher Pilehvar and Jos ´e Camacho-Collados. Wic: the word-in-context dataset for evalu- ating context-sensitive meaning representations. In NAACL, 2019
work page 2019
- [51]
-
[52]
Language models are unsupervised multitask learners
Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. 2019
work page 2019
-
[53]
Jack W. Rae, Sebastian Borgeaud, Trevor Cai, Katie Millican, Jordan Hoffmann, Francis Song, John Aslanides, Sarah Henderson, Roman Ring, Susannah Young, Eliza Rutherford, Tom Hennigan, Jacob Menick, Albin Cassirer, Richard Powell, George van den Driessche, Lisa Anne Hendricks, Maribeth Rauh, Po-Sen Huang, Amelia Glaese, Johannes Welbl, Sumanth Dathathri, ...
work page 2021
-
[54]
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
Colin Raffel, Noam Shazeer, et al. Exploring the Limits of Transfer Learning with a Unified Text-to- Text Transformer. ArXiv, abs/1910.10683, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1910
-
[55]
Zero: Memory optimizations toward training trillion parameter models
Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. Zero: Memory optimizations toward training trillion parameter models. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–16. IEEE, 2020. 32
work page 2020
-
[56]
Zero-infinity: Breaking the gpu memory wall for extreme scale deep learning
Samyam Rajbhandari, Olatunji Ruwase, Jeff Rasley, Shaden Smith, and Yuxiong He. Zero-infinity: Breaking the gpu memory wall for extreme scale deep learning. arXiv preprint arXiv:2104.07857 , 2021
-
[57]
Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase, and Yuxiong He. Deepspeed: System optimiza- tions enable training deep learning models with over 100 billion parameters. InProceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 3505–3506, 2020
work page 2020
-
[58]
Winogrande: An adver- sarial winograd schema challenge at scale
Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Winogrande: An adver- sarial winograd schema challenge at scale. In AAAI, 2020
work page 2020
-
[59]
Multitask Prompted Training Enables Zero-Shot Task Generalization
Victor Sanh, Albert Webson, Colin Raffel, Stephen H. Bach, Lintang A. Sutawika, Zaid Alyafeai, Antoine Chaffin, Arnaud Stiegler, Teven Le Scao, Arun Raja, Manan Dey, M SAIFUL BARI, Canwen Xu, Urmish Thakker, Shanya Sharma Sharma, Eliza Szczechla, Taewoon Kim, Gunjan Chhablani, Nihal V . Nayak, Debajyoti Datta, Jonathan Chang, Mike Tian-Jian Jiang, Han Wang...
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[60]
Self-diagnosis and self-debiasing: A proposal for reducing corpus-based bias in nlp
Timo Schick, Sahana Udupa, and Hinrich Sch ¨utze. Self-diagnosis and self-debiasing: A proposal for reducing corpus-based bias in nlp. arXiv preprint arXiv:2103.00453, 2021
-
[61]
Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer
Noam M. Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc V . Le, Geoffrey E. Hin- ton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. ArXiv, abs/1701.06538, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[62]
The woman worked as a babysitter: On biases in language generation
Emily Sheng, Kai-Wei Chang, Premkumar Natarajan, and Nanyun Peng. The woman worked as a babysitter: On biases in language generation. In Proceedings of the 2019 Conference on Empiri- cal Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3407–3412, Hong Kong, China, Novemb...
work page 2019
-
[63]
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catan- zaro. Megatron-lm: Training multi-billion parameter language models using model parallelism.CoRR, abs/1909.08053, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1909
-
[64]
Robyn Speer. ftfy. Zenodo, 2019. Version 5.5
work page 2019
- [65]
-
[66]
Trieu H. Trinh and Quoc V . Le. A simple method for commonsense reasoning.CoRR, abs/1806.02847, 2018
-
[67]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. CoRR, abs/1706.03762, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[68]
Boxin Wang, Shuohang Wang, Yu Cheng, Zhe Gan, R. Jia, Bo Li, and Jingjing Liu. Infobert: Improv- ing robustness of language models from an information theoretic perspective. ArXiv, abs/2010.02329, 2021. 33
-
[69]
Towards zero-label language learning.ArXiv, abs/2109.09193, 2021
Zirui Wang, Adams Wei Yu, Orhan Firat, and Yuan Cao. Towards zero-label language learning.ArXiv, abs/2109.09193, 2021
-
[70]
Finetuned Language Models Are Zero-Shot Learners
Jason Wei, Maarten Bosma, Vincent Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M. Dai, and Quoc V . Le. Finetuned language models are zero-shot learners. ArXiv, abs/2109.01652, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[71]
Ethical and social risks of harm from Language Models
Laura Weidinger, John Mellor, Maribeth Rauh, Conor Griffin, Jonathan Uesato, Po-Sen Huang, Myra Cheng, Mia Glaese, Borja Balle, Atoosa Kasirzadeh, Zac Kenton, Sasha Brown, Will Hawkins, Tom Stepleton, Courtney Biles, Abeba Birhane, Julia Haas, Laura Rimell, Lisa Anne Hendricks, William Isaac, Sean Legassick, Geoffrey Irving, and Iason Gabriel. Ethical and ...
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[72]
Challenges in detoxifying language models
Johannes Welbl, Amelia Glaese, Jonathan Uesato, Sumanth Dathathri, John Mellor, Lisa Anne Hen- dricks, Kirsty Anderson, Pushmeet Kohli, Ben Coppin, and Po-Sen Huang. Challenges in detoxifying language models. arXiv preprint arXiv:2109.07445, 2021
-
[73]
A broad-coverage challenge corpus for sen- tence understanding through inference
Adina Williams, Nikita Nangia, and Samuel Bowman. A broad-coverage challenge corpus for sen- tence understanding through inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 1112–1122. Association for Computational Linguis...
work page 2018
-
[74]
Yuan 1.0: Large-scale pre-trained language model in zero-shot and few-shot learning
Shaohua Wu, Xudong Zhao, Tong Yu, Rongguo Zhang, Chong Shen, Hongli Liu, Feng Li, Hong Zhu, Jiangang Luo, Liang Xu, and Xuanwei Zhang. Yuan 1.0: Large-scale pre-trained language model in zero-shot and few-shot learning. ArXiv, abs/2110.04725, 2021
-
[75]
Learning and Evaluating General Linguistic Intelligence
Dani Yogatama, Cyprien de Masson d’Autume, Jerome T. Connor, Tom ´as Kocisk ´y, Mike Chrzanowski, Lingpeng Kong, Angeliki Lazaridou, Wang Ling, Lei Yu, Chris Dyer, and Phil Blunsom. Learning and evaluating general linguistic intelligence. CoRR, abs/1901.11373, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1901
-
[76]
Hellaswag: Can a machine really finish your sentence? In ACL, 2019
Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence? In ACL, 2019
work page 2019
-
[77]
Defending against neural fake news
Rowan Zellers, Ari Holtzman, Hannah Rashkin, Yonatan Bisk, Ali Farhadi, Franziska Roesner, and Yejin Choi. Defending against neural fake news. CoRR, abs/1905.12616, 2019
-
[78]
Wei Zeng, Xiaozhe Ren, Teng Su, Hui Wang, Yi Liao, Zhiwei Wang, Xin Jiang, ZhenZhang Yang, Kaisheng Wang, Xiaoda Zhang, Chen Li, Ziyan Gong, Yifan Yao, Xinjing Huang, Jun Wang, Jian- feng Yu, Qilong Guo, Yue Yu, Yan Zhang, Jin Wang, Heng Tao, Dasen Yan, Zexuan Yi, Fang Peng, Fan Jiang, Han Zhang, Lingfeng Deng, Yehong Zhang, Zhengping Lin, Chao Zhang, Sha...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.