arxiv: 1910.10683 · v4 · submitted 2019-10-23 · 💻 cs.LG · cs.CL· stat.ML

Recognition: 2 theorem links

· Lean Theorem

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Colin Raffel , Noam Shazeer , Adam Roberts , Katherine Lee , Sharan Narang , Michael Matena , Yanqi Zhou , Wei Li

show 1 more author

Peter J. Liu

Authors on Pith no claims yet

Pith reviewed 2026-05-12 05:33 UTC · model grok-4.3

classification 💻 cs.LG cs.CLstat.ML

keywords transfer learningtext-to-texttransformerpre-trainingnatural language processingsummarizationquestion answeringtext classification

0 comments

The pith

A single text-to-text transformer pre-trained on a large cleaned web corpus reaches state-of-the-art results on many NLP benchmarks when fine-tuned uniformly.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests how far transfer learning can go in natural language processing by turning every task into the same text-to-text format. It compares pre-training goals, model sizes, data sources, and fine-tuning methods across dozens of tasks. The authors introduce a massive cleaned web dataset and find that scaling up the unified approach produces top performance on summarization, question answering, text classification, and related problems. This shows that one model and training recipe can handle a wide range of language tasks without custom setups for each one.

Core claim

By converting every text-based language problem into a text-to-text format and pre-training a transformer on the Colossal Clean Crawled Corpus with a denoising objective, the resulting model achieves state-of-the-art results on many benchmarks when fine-tuned on downstream tasks covering summarization, question answering, text classification, and more.

What carries the argument

The text-to-text framework that represents every input and output as plain text strings, allowing one transformer architecture and pre-training procedure to serve all tasks.

If this is right

One pre-trained model can be adapted to many tasks without designing separate architectures for each.
Larger model scale combined with cleaner and larger unlabeled data improves transfer performance across benchmarks.
Systematic comparison of pre-training objectives and data sources identifies which choices transfer most effectively.
Releasing the pre-trained models, new dataset, and code allows direct reuse and extension by others.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The uniform format may reduce the engineering effort needed to apply models to new language problems.
If the text-to-text approach works across many tasks, it could simplify evaluation and comparison of future models.
The success with web-scale cleaned data suggests that data quality and volume matter as much as model architecture for transfer.

Load-bearing premise

Converting every language task into a text-to-text generation problem preserves all necessary information for solving the original task.

What would settle it

A language task where even a very large text-to-text model, after fine-tuning, scores substantially below the best task-specific models on standard metrics.

read the original abstract

Transfer learning, where a model is first pre-trained on a data-rich task before being fine-tuned on a downstream task, has emerged as a powerful technique in natural language processing (NLP). The effectiveness of transfer learning has given rise to a diversity of approaches, methodology, and practice. In this paper, we explore the landscape of transfer learning techniques for NLP by introducing a unified framework that converts all text-based language problems into a text-to-text format. Our systematic study compares pre-training objectives, architectures, unlabeled data sets, transfer approaches, and other factors on dozens of language understanding tasks. By combining the insights from our exploration with scale and our new ``Colossal Clean Crawled Corpus'', we achieve state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more. To facilitate future work on transfer learning for NLP, we release our data set, pre-trained models, and code.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

T5 unifies NLP tasks as text-to-text, runs controlled ablations on objectives and scale, and releases everything to back its SOTA claims.

read the letter

The main thing to know is that this paper unifies a wide range of NLP tasks under a single text-to-text framework and backs it up with systematic experiments on objectives, model sizes, and data sources. What they do well is run controlled comparisons. They test several pre-training objectives on the same setup, compare encoder-decoder to other architectures, and introduce the Colossal Clean Crawled Corpus as a new unlabeled dataset. Combining these with larger models leads to state-of-the-art numbers on tasks like summarization and question answering. The decision to release the pre-trained models, code, and the C4 data makes the claims easier to check and build upon. The softer parts are around the data preparation. The cleaning heuristics for C4 are described but not deeply ablated, so it's hard to know how sensitive the results are to those choices. Also, while they show the text-to-text approach works, it's not always clear how much information is lost when forcing every task into generation format, though their results suggest it's minimal for the tasks they test. The SOTA claims are benchmark-specific and could shift with new test sets or different fine-tuning protocols. This is worth the time for anyone in NLP transfer learning or scaling laws. The empirical work is thorough enough and the releases add real value, so it deserves a serious referee.

Referee Report

2 major / 3 minor

Summary. The manuscript introduces T5, a unified text-to-text transformer framework that reformulates all NLP tasks as sequence-to-sequence generation problems. It conducts a systematic empirical study comparing pre-training objectives (e.g., span corruption), model architectures (encoder-decoder vs. decoder-only), unlabeled datasets, and transfer methods across dozens of tasks. By scaling models up to 11B parameters and pre-training on the new Colossal Clean Crawled Corpus (C4), the authors report state-of-the-art results on benchmarks spanning summarization, question answering, text classification, and more, while releasing the models, code, and C4 dataset.

Significance. If the reported results hold under independent verification, the work is significant for establishing a simple, scalable, and unified approach to transfer learning that outperforms prior specialized methods. The thorough controlled ablations isolating the contributions of objective, architecture, and data, combined with the public release of artifacts, provide a strong foundation for future research and reproducibility in NLP.

major comments (2)

[§4.2, Table 7] §4.2 and Table 7: The headline SOTA claims on GLUE and SuperGLUE rely on single-run fine-tuning results without reported standard deviations or statistical significance tests across multiple random seeds; given known variance in fine-tuning large models, this weakens the strength of the cross-task superiority claims.
[§3.4] §3.4: The comparison of pre-training objectives uses fixed compute budgets, but the paper does not quantify whether the observed advantage of span corruption over alternatives (e.g., language modeling) persists when allowing each objective its own optimal hyperparameter search or longer training; this is load-bearing for the recommendation of the default objective.

minor comments (3)

[§2] The model size nomenclature (small, base, large, 3B, 11B) is introduced gradually; a single summary table early in §2 or §3 would improve readability.
[Figure 3] Figure 3 (scaling curves): The x-axis for parameter count is logarithmic but the tick labels and legend could be enlarged for clarity in print.
[Appendix A.3] Appendix A.3 on C4 cleaning heuristics is detailed, but a short paragraph in the main text summarizing the key filtering steps would help readers without requiring appendix consultation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive assessment and recommendation for minor revision. We address each major comment below.

read point-by-point responses

Referee: [§4.2, Table 7] §4.2 and Table 7: The headline SOTA claims on GLUE and SuperGLUE rely on single-run fine-tuning results without reported standard deviations or statistical significance tests across multiple random seeds; given known variance in fine-tuning large models, this weakens the strength of the cross-task superiority claims.

Authors: We acknowledge that reporting standard deviations from multiple random seeds would provide stronger statistical support for the SOTA claims. Due to the prohibitive computational expense of repeated fine-tuning runs for models up to 11B parameters, we reported single-run results for the primary GLUE and SuperGLUE numbers. The observed gains are large in magnitude and consistent across dozens of tasks and model scales, which reduces the likelihood that they arise from random seed variance alone. In the revised manuscript we will add a brief discussion in §4.2 noting the single-run protocol and referencing prior studies on fine-tuning variance. revision: partial
Referee: [§3.4] §3.4: The comparison of pre-training objectives uses fixed compute budgets, but the paper does not quantify whether the observed advantage of span corruption over alternatives (e.g., language modeling) persists when allowing each objective its own optimal hyperparameter search or longer training; this is load-bearing for the recommendation of the default objective.

Authors: We deliberately held compute budgets fixed across objectives to isolate the effect of the pre-training task itself rather than differences in training duration or hyperparameter optimization. This controlled design is standard for large-scale ablation studies. While we did not conduct per-objective hyperparameter sweeps or extended training, span corruption produced clear and consistent gains under the equal-compute regime. We will revise §3.4 to explicitly state this rationale and note that further per-objective optimization remains an interesting direction for future work. revision: partial

Circularity Check

0 steps flagged

No significant circularity; empirical results are self-contained

full rationale

The paper conducts a large-scale empirical exploration of transfer learning by reformulating NLP tasks as text-to-text problems, systematically ablating pre-training objectives, architectures, data sources, and scaling behaviors across dozens of benchmarks. All central claims (including SOTA results) derive from direct experimental measurements on the released C4 corpus and models rather than from any closed-form derivations, fitted parameters renamed as predictions, or self-citation chains. No equations or uniqueness theorems are invoked that reduce the reported outcomes to inputs by construction; the work is therefore independent and verifiable through the provided artifacts.

Axiom & Free-Parameter Ledger

3 free parameters · 1 axioms · 0 invented entities

The central claim rests on empirical comparisons rather than closed-form derivations. Free parameters include model scale choices, pre-training objective variants, and the specific data-cleaning rules used to construct the Colossal Clean Crawled Corpus. The work assumes standard transformer inductive biases and the effectiveness of transfer from unlabeled pre-training.

free parameters (3)

model scale (small to 11B parameters)
Different parameter counts are trained and compared; performance depends on these choices.
pre-training objective variants
Multiple objectives (e.g., span corruption) are selected and evaluated; results are sensitive to which are used.
C4 data-cleaning heuristics
Rules for filtering the crawled corpus are introduced and affect the pre-training data distribution.

axioms (1)

domain assumption Pre-training on large unlabeled text followed by fine-tuning improves performance on downstream language tasks
Invoked as the foundation for all transfer experiments in the abstract.

pith-pipeline@v0.9.0 · 5485 in / 1497 out tokens · 79829 ms · 2026-05-12T05:33:22.657738+00:00 · methodology

discussion (0)

Forward citations

Cited by 46 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
cs.CL 2022-01 accept novelty 9.0

Chain-of-thought prompting, by including intermediate reasoning steps in few-shot examples, elicits strong reasoning abilities in large language models on arithmetic, commonsense, and symbolic tasks.
Online Learning-to-Defer with Varying Experts
stat.ML 2026-05 unverdicted novelty 8.0

Presents the first online learning-to-defer algorithm with regret bounds O((n + n_e) T^{2/3}) generally and O((n + n_e) sqrt(T)) under low noise for multiclass classification with varying experts.
Show Your Work: Scratchpads for Intermediate Computation with Language Models
cs.LG 2021-11 unverdicted novelty 8.0

Training language models to generate intermediate computation steps on a scratchpad enables them to perform multi-step tasks such as long addition and arbitrary program execution that they otherwise fail at.
The Pile: An 800GB Dataset of Diverse Text for Language Modeling
cs.CL 2020-12 conditional novelty 8.0

The Pile is a newly constructed 825 GiB dataset from 22 diverse sources that enables language models to achieve better performance on academic, professional, and cross-domain tasks than models trained on Common Crawl ...
Measuring Massive Multitask Language Understanding
cs.CY 2020-09 accept novelty 8.0

Introduces the MMLU benchmark of 57 tasks and shows that current models, including GPT-3, achieve low accuracy far below expert level across academic and professional domains.
REALM: Retrieval-Augmented Language Model Pre-Training
cs.CL 2020-02 accept novelty 8.0

REALM augments language-model pre-training with an unsupervised retriever over Wikipedia documents and reports 4-16% absolute gains on open-domain QA benchmarks over prior implicit and explicit knowledge methods.
The Benefits of Temporal Correlations: SGD Learns k-Juntas from Random Walks Efficiently
cs.LG 2026-05 unverdicted novelty 7.0

Temporal correlations from lazy random walks enable efficient SGD learning of k-juntas via temporal-difference loss on ReLU networks, achieving linear sample complexity in d.
OA-WAM: Object-Addressable World Action Model for Robust Robot Manipulation
cs.RO 2026-05 unverdicted novelty 7.0

OA-WAM uses persistent address vectors and dynamic content vectors in object slots to enable addressable world-action prediction, improving robustness on manipulation benchmarks under scene changes.
SWAN: Semantic Watermarking with Abstract Meaning Representation
cs.CL 2026-05 unverdicted novelty 7.0

SWAN uses AMR to embed semantic watermarks that persist through paraphrases, matching SOTA detection on original text and improving AUC by 13.9 points on paraphrased RealNews data.
AttentionBender: Manipulating Cross-Attention in Video Diffusion Transformers as a Creative Probe
cs.MM 2026-04 unverdicted novelty 7.0

AttentionBender applies 2D transforms to cross-attention maps in video diffusion transformers, producing distributed distortions and glitch aesthetics that reveal entangled attention mechanisms while serving as both a...
Dual Triangle Attention: Effective Bidirectional Attention Without Positional Embeddings
q-bio.QM 2026-04 unverdicted novelty 7.0

Dual Triangle Attention achieves effective bidirectional attention with built-in positional inductive bias via dual triangular masks, outperforming standard bidirectional attention on position-sensitive tasks and show...
Unlocking Prompt Infilling Capability for Diffusion Language Models
cs.CL 2026-04 unverdicted novelty 7.0

Full-sequence masking in SFT unlocks prompt infilling for masked diffusion language models, producing templates that match or surpass hand-designed ones and transfer across models.
Analyzing Reverse Address Translation Overheads in Multi-GPU Scale-Up Pods
cs.DC 2026-04 unverdicted novelty 7.0

Simulation study shows cold TLB misses in reverse address translation dominate latency for small collectives in multi-GPU pods, causing up to 1.4x degradation, while larger ones see diminishing returns.
Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads
cs.LG 2024-01 conditional novelty 7.0

Medusa augments LLMs with multiple decoding heads and tree-based attention to predict and verify several tokens in parallel, yielding 2.2-3.6x inference speedup via two fine-tuning regimes.
LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale
cs.LG 2022-08 conditional novelty 7.0

LLM.int8() performs 8-bit inference for transformers up to 175B parameters with no accuracy loss by combining vector-wise quantization for most features with 16-bit mixed-precision handling of systematic outlier dimensions.
Flamingo: a Visual Language Model for Few-Shot Learning
cs.CV 2022-04 unverdicted novelty 7.0

Flamingo models reach new state-of-the-art few-shot results on image and video tasks by bridging frozen vision and language models with cross-attention layers trained on interleaved web-scale data.
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity
cs.LG 2021-01 accept novelty 7.0

Switch Transformers use top-1 expert routing in a Mixture of Experts setup to scale to trillion-parameter language models with constant compute and up to 4x speedup over T5-XXL.
GraphCodeBERT: Pre-training Code Representations with Data Flow
cs.SE 2020-09 accept novelty 7.0

GraphCodeBERT uses data flow graphs in pre-training to capture semantic code structure and reaches state-of-the-art results on code search, clone detection, translation, and refinement.
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks
cs.CL 2020-05 accept novelty 7.0

RAG models set new state-of-the-art results on open-domain QA by retrieving Wikipedia passages and conditioning a generative model on them, while also producing more factual text than parametric baselines.
ALBERT: A Lite BERT for Self-supervised Learning of Language Representations
cs.CL 2019-09 accept novelty 7.0

ALBERT reduces BERT parameters via embedding factorization and layer sharing, adds inter-sentence coherence pretraining, and reaches SOTA on GLUE, RACE, and SQuAD with fewer parameters than BERT-large.
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
cs.CL 2019-09 unverdicted novelty 7.0

Intra-layer model parallelism in PyTorch enables training of 8.3B-parameter transformers, achieving SOTA perplexity of 10.8 on WikiText103 and 66.5% accuracy on LAMBADA.
The two clocks and the innovation window: When and how generative models learn rules
cs.LG 2026-05 unverdicted novelty 6.0

Generative models learn rules before memorizing data, creating an innovation window whose width depends on dataset size and rule complexity, observed in both diffusion and autoregressive architectures.
Diversity in Large Language Models under Supervised Fine-Tuning
cs.LG 2026-04 unverdicted novelty 6.0

TOFU loss mitigates the narrowing of generative diversity in LLMs after supervised fine-tuning by addressing neglect of low-frequency patterns and forgetting of prior knowledge.
Whose Story Gets Told? Positionality and Bias in LLM Summaries of Life Narratives
cs.CL 2026-04 unverdicted novelty 6.0

A proposed pipeline shows LLMs introduce detectable race and gender biases when summarizing life narratives, creating potential for representational harm in research.
Sketching the Readout of Large Language Models for Scalable Data Attribution and Valuation
cs.LG 2026-04 unverdicted novelty 6.0

RISE applies CountSketch to dual lexical and semantic channels derived from output-layer gradient outer products, cutting data attribution storage by up to 112x and enabling retrospective and prospective influence ana...
mimic-video: Video-Action Models for Generalizable Robot Control Beyond VLAs
cs.RO 2025-12 unverdicted novelty 6.0

mimic-video combines internet video pretraining with a flow-matching decoder to achieve state-of-the-art robotic manipulation performance with 10x better sample efficiency than vision-language-action models.
Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model
cs.AI 2024-08 unverdicted novelty 6.0

A single transformer combines language modeling loss and diffusion loss on mixed-modality data, scaling to 7B parameters and 2T tokens while matching specialized language and diffusion models.
Vision Transformers Need Registers
cs.CV 2023-09 unverdicted novelty 6.0

Adding register tokens to Vision Transformers eliminates high-norm background artifacts and raises state-of-the-art performance on dense visual prediction tasks.
DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models
cs.LG 2023-09 accept novelty 6.0

DeepSpeed-Ulysses keeps communication volume constant for sequence-parallel attention when sequence length and device count scale together, delivering 2.5x faster training on 4x longer sequences than prior SOTA.
Language Models (Mostly) Know What They Know
cs.CL 2022-07 unverdicted novelty 6.0

Language models show good calibration when asked to estimate the probability that their own answers are correct, with performance improving as models get larger.
Scaling Autoregressive Models for Content-Rich Text-to-Image Generation
cs.CV 2022-06 unverdicted novelty 6.0

Scaling an autoregressive Transformer to 20B parameters for text-to-image generation using image token sequences achieves new SOTA zero-shot FID of 7.23 and fine-tuned FID of 3.22 on MS-COCO.
CoCa: Contrastive Captioners are Image-Text Foundation Models
cs.CV 2022-05 accept novelty 6.0

CoCa unifies contrastive and generative pretraining in one image-text model to reach 86.3% zero-shot ImageNet accuracy and new state-of-the-art results on multiple downstream benchmarks.
ST-MoE: Designing Stable and Transferable Sparse Expert Models
cs.CL 2022-02 unverdicted novelty 6.0

ST-MoE introduces stability techniques for sparse expert models, allowing a 269B-parameter model to achieve state-of-the-art transfer learning results across reasoning, summarization, and QA tasks at the compute cost ...
Unsupervised Dense Information Retrieval with Contrastive Learning
cs.IR 2021-12 unverdicted novelty 6.0

Contrastive learning trains unsupervised dense retrievers that beat BM25 on most BEIR datasets and support cross-lingual retrieval across scripts.
Ethical and social risks of harm from Language Models
cs.CL 2021-12 accept novelty 6.0

The authors provide a detailed taxonomy of 21 risks associated with language models, covering discrimination, information leaks, misinformation, malicious applications, interaction harms, and societal impacts like job...
A General Language Assistant as a Laboratory for Alignment
cs.CL 2021-12 conditional novelty 6.0

Ranked preference modeling outperforms imitation learning for language model alignment and scales more favorably with model size.
Linformer: Self-Attention with Linear Complexity
cs.LG 2020-06 conditional novelty 6.0

Linformer approximates self-attention with a low-rank projection to achieve O(n) time and space complexity while matching Transformer accuracy on standard NLP tasks.
CodeBERT: A Pre-Trained Model for Programming and Natural Languages
cs.CL 2020-02 unverdicted novelty 6.0

CodeBERT pre-trains a bimodal model on code and text pairs plus unimodal data to achieve state-of-the-art results on natural language code search and code documentation generation.
How Much Knowledge Can You Pack Into the Parameters of a Language Model?
cs.CL 2020-02 accept novelty 6.0

Fine-tuned language models store knowledge in parameters to answer questions competitively with retrieval-based open-domain QA systems.
HuggingFace's Transformers: State-of-the-art Natural Language Processing
cs.CL 2019-10 accept novelty 6.0

Hugging Face releases an open-source Python library that supplies a unified API and pretrained weights for major Transformer architectures used in natural language processing.
Diversity in Large Language Models under Supervised Fine-Tuning
cs.LG 2026-04 unverdicted novelty 5.0

Supervised fine-tuning narrows LLM generative diversity through neglect of low-frequency patterns and knowledge forgetting, but the TOFU loss mitigates this effect across models and benchmarks.
Uncertainty-Aware Transformers: Conformal Prediction for Language Models
cs.LG 2026-04 unverdicted novelty 5.0

CONFIDE applies conformal prediction to transformer embeddings for valid prediction sets, improving accuracy up to 4.09% and efficiency over baselines on models like BERT-tiny.
Voice Biomarkers for Depression and Anxiety
cs.LG 2026-05 unverdicted novelty 4.0

Deep learning models extract content-agnostic voice biomarkers for depression and anxiety from a ~65k-utterance proprietary dataset, achieving 71% sensitivity and specificity when combined with lexical features.
Comparison of Modern Multilingual Text Embedding Techniques for Hate Speech Detection Task
cs.CL 2026-04 unverdicted novelty 4.0

Supervised models using embeddings like jina and e5 reach up to 92% accuracy on multilingual hate speech detection, substantially outperforming anomaly detection, while PCA to 64 dimensions preserves most performance ...
Gemma: Open Models Based on Gemini Research and Technology
cs.CL 2024-03 accept novelty 4.0

Gemma introduces open 2B and 7B LLMs derived from Gemini technology that beat comparable open models on 11 of 18 text tasks and come with safety assessments.
Gemma 2: Improving Open Language Models at a Practical Size
cs.CL 2024-07 conditional novelty 3.0

Gemma 2 models achieve leading performance at their sizes by combining established Transformer modifications with knowledge distillation for the 2B and 9B variants.

Reference graph

Works this paper leans on

83 extracted references · 83 canonical work pages · cited by 45 Pith papers · 19 internal anchors

[1]

Memory-efficient adaptive optimization for large-scale learning.arXiv preprint arXiv:1901.11150,

Rohan Anil, Vineet Gupta, Tomer Koren, and Yoram Singer. Memory-efficient adaptive optimization for large-scale learning.arXiv preprint arXiv:1901.11150,

work page arXiv 1901
[2]

Massively multilingual neural machine translation in the wild: Findings and challenges

Naveen Arivazhagan, Ankur Bapna, Orhan Firat, Dmitry Lepikhin, Melvin Johnson, Maxim Krikun, Mia Xu Chen, Yuan Cao, George Foster, Colin Cherry, et al. Massively multi- lingual neural machine translation in the wild: Findings and challenges.arXiv preprint arXiv:1907.05019,

work page arXiv 1907
[3]

Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton. Layer normalization.arXiv preprint arXiv:1607.06450,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Cloze-driven Pretraining of Self-attention Networks

Alexei Baevski, Sergey Edunov, Yinhan Liu, Luke Zettlemoyer, and Michael Auli. Cloze- driven pretraining of self-attention networks.arXiv preprint arXiv:1903.07785,

work page Pith review arXiv 1903
[5]

Simple, scalable adaptation for neural machine translation.arXiv preprint arXiv:1909.08478,

Ankur Bapna, Naveen Arivazhagan, and Orhan Firat. Simple, scalable adaptation for neural machine translation.arXiv preprint arXiv:1909.08478,

work page arXiv 1909
[6]

SciBERT: A pretrained language model for scientific text

Iz Beltagy, Kyle Lo, and Arman Cohan. SciBERT: A pretrained language model for scientific text. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP),

work page 2019
[7]

Findings of the 2014 workshop on statistical machine translation

Ondřej Bojar, Christian Buck, Christian Federmann, Barry Haddow, Philipp Koehn, Jo- hannes Leveling, Christof Monz, Pavel Pecina, Matt Post, Herve Saint-Amand, et al. Findings of the 2014 workshop on statistical machine translation. InProceedings of the Ninth Workshop on Statistical Machine Translation,

work page 2014
[8]

Findings of the 2015 workshop on statistical machine translation

Ondřej Bojar, Rajen Chatterjee, Christian Federmann, Barry Haddow, Matthias Huck, Chris Hokamp, Philipp Koehn, Varvara Logacheva, Christof Monz, Matteo Negri, et al. Findings of the 2015 workshop on statistical machine translation. InProceedings of the Tenth Workshop on Statistical Machine Translation,

work page 2015
[9]

Findings of the 2016 conference on machine translation

Ondřej Bojar, Rajen Chatterjee, Christian Federmann, Yvette Graham, Barry Haddow, Matthias Huck, Antonio Jimeno Yepes, Philipp Koehn, Varvara Logacheva, Christof Monz, et al. Findings of the 2016 conference on machine translation. InProceedings of the First Conference on Machine Translation,

work page 2016
[10]

Bowman, Luke Vilnis, Oriol Vinyals, Andrew M

Samuel R. Bowman, Luke Vilnis, Oriol Vinyals, Andrew M. Dai, Rafal Jozefowicz, and Samy Bengio. Generating sentences from a continuous space.arXiv preprint arXiv:1511.06349,

work page arXiv
[11]

SemEval-2017 Task 1: Semantic Textual Similarity - Multilingual and Cross-lingual Focused Evaluation

Daniel Cer, Mona Diab, Eneko Agirre, Inigo Lopez-Gazpio, and Lucia Specia. Semeval-2017 task 1: Semantic textual similarity-multilingual and cross-lingual focused evaluation. arXiv preprint arXiv:1708.00055,

work page Pith review arXiv 2017
[12]

Long short-term memory-networks for machine reading.arXiv preprint arXiv:1601.06733,

Jianpeng Cheng, Li Dong, and Mirella Lapata. Long short-term memory-networks for machine reading.arXiv preprint arXiv:1601.06733,

work page arXiv
[13]

BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions

Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. BoolQ: Exploring the surprising difficulty of natural yes/no questions. arXiv preprint arXiv:1905.10044,

work page internal anchor Pith review arXiv 1905
[14]

Electra: Pre-training text encoders as discriminators rather than generators.arXiv preprint arXiv:2003.10555, 2020

Kevin Clark, Minh-Thang Luong, Quoc V Le, and Christopher D Manning. Electra: Pre-training text encoders as discriminators rather than generators. arXiv preprint arXiv:2003.10555,

work page arXiv 2003
[15]

SentEval: An evaluation toolkit for universal sentence representations

Alexis Conneau and Douwe Kiela. SentEval: An evaluation toolkit for universal sentence representations. arXiv preprint arXiv:1803.05449,

work page arXiv
[16]

Super- vised learning of universal sentence representations from natural language inference data

Alexis Conneau, Douwe Kiela, Holger Schwenk, Loic Barrault, and Antoine Bordes. Super- vised learning of universal sentence representations from natural language inference data. arXiv preprint arXiv:1705.02364,

work page arXiv
[17]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre- training of deep bidirectional transformers for language understanding.arXiv preprint arXiv:1810.04805,

work page internal anchor Pith review Pith/arXiv arXiv
[18]

Uniﬁed language model pre- training for natural language understanding and gen- eration

Li Dong, Nan Yang, Wenhui Wang, Furu Wei, Xiaodong Liu, Yu Wang, Jianfeng Gao, Ming Zhou, and Hsiao-Wuen Hon. Unified language model pre-training for natural language understanding and generation.arXiv preprint arXiv:1905.03197,

work page arXiv 1905
[19]

Understanding back-translation at scale

59 Raffel, Shazeer, Roberts, Lee, Narang, Matena, Zhou, Li and Liu Sergey Edunov, Myle Ott, Michael Auli, and David Grangier. Understanding back-translation at scale. arXiv preprint arXiv:1808.09381,

work page arXiv
[20]

Learning word vectors for 157 languages.arXiv preprint arXiv:1802.06893,

Edouard Grave, Piotr Bojanowski, Prakhar Gupta, Armand Joulin, and Tomas Mikolov. Learning word vectors for 157 languages.arXiv preprint arXiv:1802.06893,

work page arXiv
[21]

arXiv preprint arXiv:1308.0850 (2013) 4, 5

Alex Graves. Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850,

work page arXiv
[22]

Rethinking ImageNet pre-training.arXiv preprint arXiv:1811.08883,

Kaiming He, Ross Girshick, and Piotr Dollár. Rethinking ImageNet pre-training.arXiv preprint arXiv:1811.08883,

work page arXiv
[23]

A hybrid neural network model for commonsense reasoning

Pengcheng He, Xiaodong Liu, Weizhu Chen, and Jianfeng Gao. A hybrid neural network model for commonsense reasoning.arXiv preprint arXiv:1907.11983,

work page arXiv 1907
[24]

Deep Learning Scaling is Predictable, Empirically

Joel Hestness, Sharan Narang, Newsha Ardalani, Gregory Diamos, Heewoo Jun, Hassan Kianinejad, Md. Mostofa Ali Patwary, Yang Yang, and Yanqi Zhou. Deep learning scaling is predictable, empirically.arXiv preprint arXiv:1712.00409,

work page internal anchor Pith review Pith/arXiv arXiv
[25]

Learning distributed representations of sentences from unlabelled data.arXiv preprint arXiv:1602.03483,

Felix Hill, Kyunghyun Cho, and Anna Korhonen. Learning distributed representations of sentences from unlabelled data.arXiv preprint arXiv:1602.03483,

work page arXiv
[26]

Distilling the Knowledge in a Neural Network

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531,

work page internal anchor Pith review Pith/arXiv arXiv
[27]

Parameter-Efficient Transfer Learning for NLP

Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. Parameter-efficient transfer learning for NLP.arXiv preprint arXiv:1902.00751,

work page Pith review arXiv 1902
[28]

Universal Language Model Fine-tuning for Text Classification

Jeremy Howard and Sebastian Ruder. Universal language model fine-tuning for text classifi- cation. arXiv preprint arXiv:1801.06146,

work page Pith review arXiv
[29]

GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism

Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob Uszkoreit, Ian Simon, Curtis Hawthorne, Noam Shazeer, Andrew M. Dai, Matthew D. Hoffman, Monica Dinculescu, and Dou- glas Eck. Music transformer: Generating music with long-term structure. InSeventh International Conference on Learning Representations, 2018a. 60 Exploring the Limits of Transfer Learning Yanping ...

work page Pith review arXiv
[30]

Tinybert: Distilling bert for natural language understanding

Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang, Xiao Chen, Linlin Li, Fang Wang, and Qun Liu. TinyBERT: Distilling BERT for natural language understanding.arXiv preprint arXiv:1909.10351,

work page arXiv 1909
[31]

TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension

Mandar Joshi, Eunsol Choi, Daniel S. Weld, and Luke Zettlemoyer. TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension.arXiv preprint arXiv:1705.03551,

work page internal anchor Pith review Pith/arXiv arXiv
[32]

S., Zettlemoyer, L., and Levy, O

Mandar Joshi, Danqi Chen, Yinhan Liu, Daniel S. Weld, Luke Zettlemoyer, and Omer Levy. SpanBERT: Improving pre-training by representing and predicting spans.arXiv preprint arXiv:1907.10529,

work page arXiv 1907
[33]

Exploring the Limits of Language Modeling

Rafal Jozefowicz, Oriol Vinyals, Mike Schuster, Noam Shazeer, and Yonghui Wu. Exploring the limits of language modeling.arXiv preprint arXiv:1602.02410,

work page Pith review arXiv
[34]

Varshney, Caiming Xiong, and Richard Socher

Nitish Shirish Keskar, Bryan McCann, Lav R. Varshney, Caiming Xiong, and Richard Socher. CTRL: A conditional transformer language model for controllable generation. arXiv preprint arXiv:1909.05858, 2019a. NitishShirishKeskar, BryanMcCann, CaimingXiong, andRichardSocher. Unifyingquestion answering and text classification via span extraction.arXiv preprint ...

work page arXiv 1909
[35]

A surprisingly robust trick for Winograd schema challenge.arXiv preprint arXiv:1905.06290,

Vid Kocijan, Ana-Maria Cretu, Oana-Maria Camburu, Yordan Yordanov, and Thomas Lukasiewicz. A surprisingly robust trick for Winograd schema challenge.arXiv preprint arXiv:1905.06290,

work page arXiv 1905
[36]

Federated optimization: Dis- tributed optimization beyond the datacenter.arXiv preprint arXiv:1511.03575,

Jakub Konečn` y, Brendan McMahan, and Daniel Ramage. Federated optimization: Dis- tributed optimization beyond the datacenter.arXiv preprint arXiv:1511.03575,

work page arXiv
[37]

Federated Learning: Strategies for Improving Communication Efficiency

Jakub Konečn` y, H. Brendan McMahan, Felix X. Yu, Peter Richtárik, Ananda Theertha Suresh, and Dave Bacon. Federated learning: Strategies for improving communication efficiency. arXiv preprint arXiv:1610.05492,

work page internal anchor Pith review arXiv
[38]

Simon Kornblith, Jonathon Shlens, and Quoc V. Le. Do better ImageNet models transfer better? arXiv preprint arXiv:1805.08974,

work page arXiv
[39]

arXiv preprint arXiv:1404.5997 , year=

Alex Krizhevsky. One weird trick for parallelizing convolutional neural networks.arXiv preprint arXiv:1404.5997,

work page arXiv
[40]

Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates

Taku Kudo. Subword regularization: Improving neural network translation models with multiple subword candidates.arXiv preprint arXiv:1804.10959,

work page Pith review arXiv
[41]

SentencePiece:

Taku Kudo and John Richardson. SentencePiece: A simple and language independent sub- word tokenizer and detokenizer for neural text processing.arXiv preprint arXiv:1808.06226,

work page internal anchor Pith review arXiv
[42]

Cross-lingual Language Model Pretraining

Guillaume Lample and Alexis Conneau. Cross-lingual language model pretraining.arXiv preprint arXiv:1901.07291,

work page Pith review arXiv 1901
[43]

ALBERT: A Lite BERT for Self-supervised Learning of Language Representations

Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. ALBERT: A lite BERT for self-supervised learning of language representa- tions. arXiv preprint arXiv:1909.11942,

work page internal anchor Pith review arXiv 1909
[44]

Generating Wikipedia by summarizing long sequences

Peter J. Liu, Mohammad Saleh, Etienne Pot, Ben Goodrich, Ryan Sepassi, Lukasz Kaiser, and Noam Shazeer. Generating Wikipedia by summarizing long sequences.arXiv preprint arXiv:1801.10198,

work page arXiv
[45]

Liu, Yu-An Chung, and Jie Ren

Peter J. Liu, Yu-An Chung, and Jie Ren. SummAE: Zero-shot abstractive text summarization using length-agnostic auto-encoders.arXiv preprint arXiv:1910.00998, 2019a. 62 Exploring the Limits of Transfer Learning Xiaodong Liu, Jianfeng Gao, Xiaodong He, Li Deng, Kevin Duh, and Ye-Yi Wang. Rep- resentation learning using multi-task deep neural networks for se...

work page arXiv 1910
[46]

Multi-Task Deep Neural Networks for Natural Language Understanding

Xiaodong Liu, Pengcheng He, Weizhu Chen, and Jianfeng Gao. Multi-task deep neural networks for natural language understanding.arXiv preprint arXiv:1901.11504, 2019b. Yang Liu. Fine-tune BERT for extractive summarization.arXiv preprint arXiv:1903.10318,

work page Pith review arXiv 1901
[47]

RoBERTa: A Robustly Optimized BERT Pretraining Approach

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. RoBERTa: A robustly optimized BERT pretraining approach.arXiv preprint arXiv:1907.11692, 2019c. Lajanugen Logeswaran and Honglak Lee. An efficient framework for learning sentence representations. arXiv preprint arXiv:1803.02893,

work page internal anchor Pith review Pith/arXiv arXiv 1907
[48]

The Natural Language Decathlon: Multitask Learning as Question Answering

Bryan McCann, Nitish Shirish Keskar, Caiming Xiong, and Richard Socher. The nat- ural language decathlon: Multitask learning as question answering. arXiv preprint arXiv:1806.08730,

work page Pith review arXiv
[49]

Efficient Estimation of Word Representations in Vector Space

Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector space.arXiv preprint arXiv:1301.3781, 2013a. Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S. Corrado, and Jeff Dean. Distributed representations of words and phrases and their compositionality. InAdvances in neural information processing system...

work page internal anchor Pith review Pith/arXiv arXiv
[50]

A deep reinforced model for abstractive summarization

Romain Paulus, Caiming Xiong, and Richard Socher. A deep reinforced model for abstractive summarization. arXiv preprint arXiv:1705.04304,

work page arXiv
[51]

GloVe: Global vectors for word representation

63 Raffel, Shazeer, Roberts, Lee, Narang, Matena, Zhou, Li and Liu Jeffrey Pennington, Richard Socher, and Christopher Manning. GloVe: Global vectors for word representation. InProceedings of the 2014 conference on empirical methods in natural language processing (EMNLP),

work page 2014
[52]

Matthew Peters, Sebastian Ruder, and Noah A. Smith. To tune or not to tune? adapting pretrained representations to diverse tasks.arXiv preprint arXiv:1903.05987,

work page arXiv 1903
[53]

Deep contextualized word representations

Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. Deep contextualized word representations.arXiv preprint arXiv:1802.05365,

work page Pith review arXiv
[54]

Jason Phang, Thibault Févry, and Samuel R. Bowman. Sentence encoders on STILTs: Sup- plementary training on intermediate labeled-data tasks.arXiv preprint arXiv:1811.01088,

work page Pith review arXiv
[55]

WiC: the Word-in-Context Dataset for Evaluating Context-Sensitive Meaning Representations

Mohammad Taher Pilehvar and Jose Camacho-Collados. WIC: 10,000 example pairs for evaluating context-sensitive representations.arXiv preprint arXiv:1808.09121,

work page Pith review arXiv
[56]

A call for clarity in reporting BLEU scores.arXiv preprint arXiv:1804.08771,

Matt Post. A call for clarity in reporting BLEU scores.arXiv preprint arXiv:1804.08771,

work page arXiv
[57]

Resolving complex cases of definite pronouns: the Winograd schema challenge

Altaf Rahman and Vincent Ng. Resolving complex cases of definite pronouns: the Winograd schema challenge. InProceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning. Association for Computational Linguistics,

work page 2012
[58]

SQuAD: 100,000+ Questions for Machine Comprehension of Text

Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. Squad: 100,000+ questions for machine comprehension of text.arXiv preprint arXiv:1606.05250,

work page internal anchor Pith review arXiv
[59]

Liu, and Quoc V

Prajit Ramachandran, Peter J. Liu, and Quoc V. Le. Unsupervised pretraining for sequence to sequence learning.arXiv preprint arXiv:1611.02683,

work page arXiv
[60]

An Overview of Multi-Task Learning in Deep Neural Networks

Sebastian Ruder. An overview of multi-task learning in deep neural networks.arXiv preprint arXiv:1706.05098,

work page internal anchor Pith review arXiv
[61]

Peters, Swabha Swayamdipta, and Thomas Wolf

64 Exploring the Limits of Transfer Learning Sebastian Ruder, Matthew E. Peters, Swabha Swayamdipta, and Thomas Wolf. Transfer learning in natural language processing. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Tutorials, pages 15–18,

work page 2019
[62]

DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter.arXiv preprint arXiv:1910.01108,

work page internal anchor Pith review Pith/arXiv arXiv 1910
[63]

Get To The Point: Summarization with Pointer-Generator Networks

Abigail See, Peter J. Liu, and Christopher D. Manning. Get to the point: Summarization with pointer-generator networks.arXiv preprint arXiv:1704.04368,

work page Pith review arXiv
[64]

Neural Machine Translation of Rare Words with Subword Units

Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare words with subword units.arXiv preprint arXiv:1508.07909,

work page internal anchor Pith review arXiv
[65]

Christopher J Shallue, Jaehoon Lee, Joe Antognini, Jascha Sohl-Dickstein, Roy Frostig, and George E. Dahl. Measuring the effects of data parallelism on neural network training. arXiv preprint arXiv:1811.03600,

work page Pith review arXiv
[66]

Self-attention with relative position repre- sentations

Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani. Self-attention with relative position representations. arXiv preprint arXiv:1803.02155,

work page arXiv
[67]

Adafactor: Adaptive Learning Rates with Sublinear Memory Cost

Noam Shazeer and Mitchell Stern. Adafactor: Adaptive learning rates with sublinear memory cost. arXiv preprint arXiv:1804.04235,

work page Pith review arXiv
[68]

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538,

work page internal anchor Pith review Pith/arXiv arXiv
[69]

Manning, Andrew Ng, and Christopher Potts

Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D. Manning, Andrew Ng, and Christopher Potts. Recursive deep models for semantic compositionality over a sentiment treebank. InProceedings of the 2013 conference on empirical methods in natural language processing,

work page 2013
[70]

MASS: Masked sequence to sequence pre-training for language generation.arXiv preprint arXiv:1905.02450,

Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, and Tie-Yan Liu. MASS: Masked sequence to sequence pre-training for language generation.arXiv preprint arXiv:1905.02450,

work page arXiv 1905
[71]

Sandeep Subramanian, Adam Trischler, Yoshua Bengio, and Christopher J. Pal. Learning general purpose distributed sentence representations via large scale multi-task learning. arXiv preprint arXiv:1804.00079,

work page arXiv
[72]

Trinh and Quoc V

Trieu H. Trinh and Quoc V. Le. A simple method for commonsense reasoning.arXiv preprint arXiv:1806.02847,

work page arXiv
[73]

NewsQA: A machine comprehension dataset.arXiv preprint arXiv:1611.09830,

Adam Trischler, Tong Wang, Xingdi Yuan, Justin Harris, Alessandro Sordoni, Philip Bachman, and Kaheer Suleman. NewsQA: A machine comprehension dataset.arXiv preprint arXiv:1611.09830,

work page arXiv
[74]

The bottom-up evolution of representations in the transformer: A study with machine translation and language modeling objectives

Elena Voita, Rico Sennrich, and Ivan Titov. The bottom-up evolution of representations in the transformer: A study with machine translation and language modeling objectives. arXiv preprint arXiv:1909.01380,

work page arXiv 1909
[75]

Alex Wang, Amapreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. GLUE: A multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461,

work page internal anchor Pith review arXiv
[76]

SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems

Alex Wang, Jan Hula, Patrick Xia, Raghavendra Pappagari, R. Thomas McCoy, Roma Patel, Najoung Kim, Ian Tenney, Yinghui Huang, Katherin Yu, et al. Can you tell me how to get past Sesame Street? Sentence-level pretraining beyond language modeling. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019a. Alex Wang, Y...

work page internal anchor Pith review arXiv 1905
[77]

Adina Williams, Nikita Nangia, and Samuel R. Bowman. A broad-coverage challenge corpus for sentence understanding through inference.arXiv preprint arXiv:1704.05426,

work page arXiv
[78]

Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation

Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144,

work page internal anchor Pith review arXiv
[79]

Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, and Quoc V. Le. XLNet: Generalized autoregressive pretraining for language understanding.arXiv preprint arXiv:1906.08237,

work page arXiv 1906
[80]

Adams Wei Yu, David Dohan, Minh-Thang Luong, Rui Zhao, Kai Chen, Mohammad Norouzi, and Quoc V. Le. QAnet: Combining local convolution with global self-attention for reading comprehension.arXiv preprint arXiv:1804.09541,

work page Pith review arXiv

Showing first 80 references.