To Tune or Not To Tune? How About the Best of Both Worlds?

Chunye Wang; Haibo Su; Jupeng Ding; Kailin Ji; Ran Wang

arxiv: 1907.05338 · v1 · pith:UKRDTWNAnew · submitted 2019-07-09 · 💻 cs.CL · cs.LG

To Tune or Not To Tune? How About the Best of Both Worlds?

Ran Wang , Haibo Su , Chunye Wang , Kailin Ji , Jupeng Ding This is my paper

Pith reviewed 2026-05-25 00:43 UTC · model grok-4.3

classification 💻 cs.CL cs.LG

keywords BERT adaptationfine-tuningpre-trained language modelssemantic similaritysequence labelingtext classificationmodel tuning

0 comments

The pith

A two-stage adaptation method first freezes BERT then jointly fine-tunes the full model to raise accuracy on downstream tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines how to adapt pre-trained language models like BERT for specific tasks. It introduces a procedure that begins by training a task model while keeping BERT parameters frozen, then continues by fine-tuning all parameters together. Experiments on three tasks report accuracy gains of 4.7 percent on semantic similarity, 0.99 percent on sequence labeling, and 0.72 percent on text classification. The approach is presented as a way to obtain the benefits of both stable early adaptation and later full optimization.

Core claim

Training first with frozen BERT parameters and then fine-tuning the entire model together produces higher accuracy than standard single-stage adaptation on semantic similarity, sequence labeling, and text classification tasks.

What carries the argument

The two-stage adaptation procedure that starts with frozen pre-trained parameters before unfreezing them for joint optimization.

If this is right

The staged method improves results over either freezing BERT entirely or fine-tuning from the outset on the tested tasks.
A light-weight task head combined with the two-stage schedule can replace more complex heads while still raising accuracy.
The same initial-freeze-then-full-tune sequence applies across semantic similarity, sequence labeling, and classification tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The early frozen phase may stabilize learning of task-specific features before the model parameters are allowed to shift.
The procedure could be tested on other pre-trained models such as RoBERTa or GPT variants to check whether similar staged gains appear.
If the gains hold, the method offers a low-cost way to adapt large models without redesigning task architectures.

Load-bearing premise

The reported accuracy gains are produced by the two-stage freezing-then-fine-tuning sequence itself rather than by differences in hyperparameters, seeds, or baseline choices.

What would settle it

Re-running the three tasks with the same data splits, seeds, and single-stage fine-tuning or frozen-head baselines and obtaining equal or higher accuracy would show the two-stage method does not drive the gains.

Figures

Figures reproduced from arXiv: 1907.05338 by Chunye Wang, Haibo Su, Jupeng Ding, Kailin Ji, Ran Wang.

read the original abstract

The introduction of pre-trained language models has revolutionized natural language research communities. However, researchers still know relatively little regarding their theoretical and empirical properties. In this regard, Peters et al. perform several experiments which demonstrate that it is better to adapt BERT with a light-weight task-specific head, rather than building a complex one on top of the pre-trained language model, and freeze the parameters in the said language model. However, there is another option to adopt. In this paper, we propose a new adaptation method which we first train the task model with the BERT parameters frozen and then fine-tune the entire model together. Our experimental results show that our model adaptation method can achieve 4.7% accuracy improvement in semantic similarity task, 0.99% accuracy improvement in sequence labeling task and 0.72% accuracy improvement in the text classification task.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper proposes a two-stage adaptation procedure for pre-trained models such as BERT: first train a task-specific head with the language-model parameters frozen, then jointly fine-tune the entire model. It reports that this schedule produces accuracy gains of 4.7% on a semantic-similarity task, 0.99% on a sequence-labeling task, and 0.72% on a text-classification task relative to the single-stage baselines of Peters et al.

Significance. If the reported deltas can be shown to arise from the two-stage schedule rather than from uncontrolled differences in hyper-parameters, seeds, or baseline implementations, the result would supply a simple, low-cost recipe that combines the stability of frozen adaptation with the flexibility of full fine-tuning. At present the manuscript supplies no evidence that permits this attribution.

major comments (1)

[Abstract / Results] Abstract and experimental-results section: the central performance claims (4.7%, 0.99%, 0.72% absolute accuracy gains) are presented without any statement of the number of runs, random seeds, standard deviations, statistical tests, exact dataset splits, learning-rate schedules, or the precise re-implementation of the Peters et al. baselines. Because every other experimental factor must be held identical for the attribution to the two-stage schedule to be valid, the absence of these controls renders the headline claim unevaluable.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for highlighting the importance of rigorous experimental reporting. The manuscript as submitted does not include the requested details on runs, seeds, and baseline implementations. We will revise the experimental section to provide this information, allowing for a clearer attribution of the observed gains to the proposed two-stage adaptation procedure.

read point-by-point responses

Referee: [Abstract / Results] Abstract and experimental-results section: the central performance claims (4.7%, 0.99%, 0.72% absolute accuracy gains) are presented without any statement of the number of runs, random seeds, standard deviations, statistical tests, exact dataset splits, learning-rate schedules, or the precise re-implementation of the Peters et al. baselines. Because every other experimental factor must be held identical for the attribution to the two-stage schedule to be valid, the absence of these controls renders the headline claim unevaluable.

Authors: We concur that without these controls, it is difficult to attribute the gains solely to the two-stage schedule. In the revised version of the manuscript, we will include: the number of independent runs and random seeds used, standard deviations of the results, any statistical tests performed, the exact dataset splits, the learning-rate schedules employed, and a precise description of how the Peters et al. baselines were re-implemented. This will strengthen the claim that the improvements come from the adaptation method rather than other factors. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical reporting of accuracy deltas

full rationale

The paper proposes a two-stage adaptation procedure for BERT and reports accuracy gains on three tasks (4.7%, 0.99%, 0.72%). No equations, derivations, fitted parameters, or self-citations appear in the provided text. The central claim is an empirical comparison whose validity rests on experimental controls rather than any mathematical reduction to inputs. This matches the reader's 0.0 assessment; no load-bearing step reduces by construction to the paper's own definitions or prior outputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The work is an empirical proposal with no mathematical content; the abstract introduces neither free parameters, background axioms, nor new postulated entities.

pith-pipeline@v0.9.0 · 5682 in / 1163 out tokens · 46741 ms · 2026-05-25T00:43:27.452859+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

40 extracted references · 40 canonical work pages · 19 internal anchors

[1]

To Tune or Not to Tune? Adapting Pretrained Representations to Diverse Tasks

Matthew Peters, Sebastian Ruder, and Noah A Smith. To tune or not to tune? adapting pretrained representations to diverse tasks. arXiv preprint arXiv:1903.05987, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1903
[2]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[3]

Language models are unsupervised multitask learners

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. OpenAI Blog, 1:8, 2019

work page 2019
[4]

SQuAD: 100,000+ Questions for Machine Comprehension of Text

Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. Squad: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[5]

CoQA: A Conversational Question Answering Challenge

Siva Reddy, Danqi Chen, and Christopher D Manning. Coqa: A conversational question answering challenge. arXiv preprint arXiv:1808.07042, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[6]

Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition

Erik F Sang and Fien De Meulder. Introduction to the conll-2003 shared task: Language-independent named entity recognition. arXiv preprint cs/0306050, 2003

work page internal anchor Pith review Pith/arXiv arXiv 2003
[7]

GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding

Alex Wang, Amapreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R Bowman. Glue: A multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[8]

Montreal neural machine translation systems for wmt’15

Sébastien Jean, Orhan Firat, Kyunghyun Cho, Roland Memisevic, and Yoshua Bengio. Montreal neural machine translation systems for wmt’15. In Proceedings of the Tenth Workshop on Statistical Machine Translation, pages 134–140, 2015

work page 2015
[9]

ERNIE: Enhanced Representation through Knowledge Integration

Yu Sun, Shuohuan Wang, Yukun Li, Shikun Feng, Xuyi Chen, Han Zhang, Xin Tian, Danxiang Zhu, Hao Tian, and Hua Wu. Ernie: Enhanced representation through knowledge integration. arXiv preprint arXiv:1904.09223, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1904
[10]

Multi-Task Deep Neural Networks for Natural Language Understanding

Xiaodong Liu, Pengcheng He, Weizhu Chen, and Jianfeng Gao. Multi-task deep neural networks for natural language understanding. arXiv preprint arXiv:1901.11504, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1901
[11]

Efﬁcient estimation of word representations in vector space

Tomas Mikolov, Kai Chen, Gregory S Corrado, and Jeffrey Dean. Efﬁcient estimation of word representations in vector space. international conference on learning representations, 2013

work page 2013
[12]

Learned in translation: Contextualized word vectors

Bryan McCann, James Bradbury, Caiming Xiong, and Richard Socher. Learned in translation: Contextualized word vectors. In Advances in Neural Information Processing Systems, pages 6294–6305, 2017

work page 2017
[13]

Multi-view Recurrent Neural Acoustic Word Embeddings

Wanjia He, Weiran Wang, and Karen Livescu. Multi-view recurrent neural acoustic word embeddings. arXiv preprint arXiv:1611.04496, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[14]

Semi-Supervised Sequence Modeling with Cross-View Training

Kevin Clark, Minh-Thang Luong, Christopher D Manning, and Quoc V Le. Semi-supervised sequence modeling with cross-view training. arXiv preprint arXiv:1809.08370, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[15]

Deep contextualized word representations

Matthew E Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher G Clark, Kenton Lee, and Luke S Zettlemoyer. Deep contextualized word representations. north american chapter of the association for computa- tional linguistics, 1:2227–2237, 2018

work page 2018
[16]

Universal language model ﬁne-tuning for text classiﬁcation.meeting of the association for computational linguistics, 1:328–339, 2018

Jeremy Howard and Sebastian Ruder. Universal language model ﬁne-tuning for text classiﬁcation.meeting of the association for computational linguistics, 1:328–339, 2018. 6 To Tune or Not To Tune? How About the Best of Both Worlds? A PREPRINT

work page 2018
[17]

Attention is all you need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008, 2017

work page 2017
[18]

Long short-term memory

Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997

work page 1997
[19]

Gated feedback recurrent neural networks

Junyoung Chung, Caglar Gulcehre, Kyunghyun Cho, and Yoshua Bengio. Gated feedback recurrent neural networks. In International Conference on Machine Learning, pages 2067–2075, 2015

work page 2067
[20]

An in-depth look at google’s ﬁrst tensor processing unit (tpu).Google Cloud Big Data and Machine Learning Blog, 12, 2017

Kaz Sato, Cliff Young, and David Patterson. An in-depth look at google’s ﬁrst tensor processing unit (tpu).Google Cloud Big Data and Machine Learning Blog, 12, 2017

work page 2017
[21]

Linguistic Knowledge and Transferability of Contextual Representations

Nelson F Liu, Matt Gardner, Yonatan Belinkov, Matthew Peters, and Noah A Smith. Linguistic knowledge and transferability of contextual representations. arXiv preprint arXiv:1903.08855, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1903
[22]

BERT and PALs: Projected Attention Layers for Efficient Adaptation in Multi-Task Learning

Asa Cooper Stickland and Iain Murray. Bert and pals: Projected attention layers for efﬁcient adaptation in multi-task learning. arXiv preprint arXiv:1902.02671, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1902
[23]

BERT Post-Training for Review Reading Comprehension and Aspect-based Sentiment Analysis

Hu Xu, Bing Liu, Lei Shu, and Philip S Yu. Bert post-training for review reading comprehension and aspect-based sentiment analysis. arXiv preprint arXiv:1904.02232, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1904
[24]

How to ﬁne-tune bert for text classiﬁcation?arXiv preprint arXiv:1905.05583, 2019

Chi Sun, Xipeng Qiu, Yige Xu, and Xuanjing Huang. How to ﬁne-tune bert for text classiﬁcation?arXiv preprint arXiv:1905.05583, 2019

work page arXiv 1905
[25]

Recursive deep models for semantic compositionality over a sentiment treebank

Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D Manning, Andrew Ng, and Christopher Potts. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on empirical methods in natural language processing, pages 1631–1642, 2013

work page 2013
[26]

A broad-coverage challenge corpus for sentence understanding through inference

Adina Williams, Nikita Nangia, and Samuel R Bowman. A broad-coverage challenge corpus for sentence understanding through inference. arXiv preprint arXiv:1704.05426, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[27]

Automatically constructing a corpus of sentential paraphrases

William B Dolan and Chris Brockett. Automatically constructing a corpus of sentential paraphrases. InProceedings of the Third International Workshop on Paraphrasing (IWP2005), 2005

work page 2005
[28]

SemEval-2017 Task 1: Semantic Textual Similarity - Multilingual and Cross-lingual Focused Evaluation

Daniel Cer, Mona Diab, Eneko Agirre, Inigo Lopez-Gazpio, and Lucia Specia. Semeval-2017 task 1: Semantic textual similarity-multilingual and cross-lingual focused evaluation. arXiv preprint arXiv:1708.00055, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[29]

Delving deep into rectiﬁers: Surpassing human-level performance on imagenet classiﬁcation

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectiﬁers: Surpassing human-level performance on imagenet classiﬁcation. In Proceedings of the IEEE international conference on computer vision, pages 1026–1034, 2015

work page 2015
[30]

Glove: Global vectors for word representation

Jeffrey Pennington, Richard Socher, and Christopher Manning. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 1532–1543, 2014

work page 2014
[31]

Named entity recognition with bidirectional lstm-cnns

Jason PC Chiu and Eric Nichols. Named entity recognition with bidirectional lstm-cnns. Transactions of the Association for Computational Linguistics, 4:357–370, 2016

work page 2016
[32]

Neural Architectures for Named Entity Recognition

Guillaume Lample, Miguel Ballesteros, Sandeep Subramanian, Kazuya Kawakami, and Chris Dyer. Neural architectures for named entity recognition. arXiv preprint arXiv:1603.01360, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[33]

End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF

Xuezhe Ma and Eduard Hovy. End-to-end sequence labeling via bi-directional lstm-cnns-crf. arXiv preprint arXiv:1603.01354, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[34]

Do convolutional networks need to be deep for text classiﬁcation? In Workshops at the Thirty-Second AAAI Conference on Artiﬁcial Intelligence, 2018

Hoa T Le, Christophe Cerisara, and Alexandre Denis. Do convolutional networks need to be deep for text classiﬁcation? In Workshops at the Thirty-Second AAAI Conference on Artiﬁcial Intelligence, 2018

work page 2018
[35]

Highway long short-term memory rnns for distant speech recognition

Yu Zhang, Guoguo Chen, Dong Yu, Kaisheng Yaco, Sanjeev Khudanpur, and James Glass. Highway long short-term memory rnns for distant speech recognition. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5755–5759. IEEE, 2016

work page 2016
[36]

Siamese recurrent architectures for learning sentence similarity

Jonas Mueller and Aditya Thyagarajan. Siamese recurrent architectures for learning sentence similarity. In Thirtieth AAAI Conference on Artiﬁcial Intelligence, 2016

work page 2016
[37]

A Compare-Aggregate Model for Matching Text Sequences

Shuohang Wang and Jing Jiang. A compare-aggregate model for matching text sequences. arXiv preprint arXiv:1611.01747, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[38]

Enhanced lstm for natural language inference

Qian Chen, Xiaodan Zhu, Zhenhua Ling, Si Wei, Hui Jiang, and Diana Inkpen. Enhanced lstm for natural language inference. arXiv preprint arXiv:1609.06038, 2016

work page arXiv 2016
[39]

Bilateral Multi-Perspective Matching for Natural Language Sentences

Zhiguo Wang, Wael Hamza, and Radu Florian. Bilateral multi-perspective matching for natural language sentences. arXiv preprint arXiv:1702.03814, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[40]

Deep visual domain adaptation: A survey

Mei Wang and Weihong Deng. Deep visual domain adaptation: A survey. Neurocomputing, 312:135–153, 2018. 7

work page 2018

[1] [1]

To Tune or Not to Tune? Adapting Pretrained Representations to Diverse Tasks

Matthew Peters, Sebastian Ruder, and Noah A Smith. To tune or not to tune? adapting pretrained representations to diverse tasks. arXiv preprint arXiv:1903.05987, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1903

[2] [2]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[3] [3]

Language models are unsupervised multitask learners

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. OpenAI Blog, 1:8, 2019

work page 2019

[4] [4]

SQuAD: 100,000+ Questions for Machine Comprehension of Text

Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. Squad: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[5] [5]

CoQA: A Conversational Question Answering Challenge

Siva Reddy, Danqi Chen, and Christopher D Manning. Coqa: A conversational question answering challenge. arXiv preprint arXiv:1808.07042, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[6] [6]

Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition

Erik F Sang and Fien De Meulder. Introduction to the conll-2003 shared task: Language-independent named entity recognition. arXiv preprint cs/0306050, 2003

work page internal anchor Pith review Pith/arXiv arXiv 2003

[7] [7]

GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding

Alex Wang, Amapreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R Bowman. Glue: A multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[8] [8]

Montreal neural machine translation systems for wmt’15

Sébastien Jean, Orhan Firat, Kyunghyun Cho, Roland Memisevic, and Yoshua Bengio. Montreal neural machine translation systems for wmt’15. In Proceedings of the Tenth Workshop on Statistical Machine Translation, pages 134–140, 2015

work page 2015

[9] [9]

ERNIE: Enhanced Representation through Knowledge Integration

Yu Sun, Shuohuan Wang, Yukun Li, Shikun Feng, Xuyi Chen, Han Zhang, Xin Tian, Danxiang Zhu, Hao Tian, and Hua Wu. Ernie: Enhanced representation through knowledge integration. arXiv preprint arXiv:1904.09223, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1904

[10] [10]

Multi-Task Deep Neural Networks for Natural Language Understanding

Xiaodong Liu, Pengcheng He, Weizhu Chen, and Jianfeng Gao. Multi-task deep neural networks for natural language understanding. arXiv preprint arXiv:1901.11504, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1901

[11] [11]

Efﬁcient estimation of word representations in vector space

Tomas Mikolov, Kai Chen, Gregory S Corrado, and Jeffrey Dean. Efﬁcient estimation of word representations in vector space. international conference on learning representations, 2013

work page 2013

[12] [12]

Learned in translation: Contextualized word vectors

Bryan McCann, James Bradbury, Caiming Xiong, and Richard Socher. Learned in translation: Contextualized word vectors. In Advances in Neural Information Processing Systems, pages 6294–6305, 2017

work page 2017

[13] [13]

Multi-view Recurrent Neural Acoustic Word Embeddings

Wanjia He, Weiran Wang, and Karen Livescu. Multi-view recurrent neural acoustic word embeddings. arXiv preprint arXiv:1611.04496, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[14] [14]

Semi-Supervised Sequence Modeling with Cross-View Training

Kevin Clark, Minh-Thang Luong, Christopher D Manning, and Quoc V Le. Semi-supervised sequence modeling with cross-view training. arXiv preprint arXiv:1809.08370, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[15] [15]

Deep contextualized word representations

Matthew E Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher G Clark, Kenton Lee, and Luke S Zettlemoyer. Deep contextualized word representations. north american chapter of the association for computa- tional linguistics, 1:2227–2237, 2018

work page 2018

[16] [16]

Universal language model ﬁne-tuning for text classiﬁcation.meeting of the association for computational linguistics, 1:328–339, 2018

Jeremy Howard and Sebastian Ruder. Universal language model ﬁne-tuning for text classiﬁcation.meeting of the association for computational linguistics, 1:328–339, 2018. 6 To Tune or Not To Tune? How About the Best of Both Worlds? A PREPRINT

work page 2018

[17] [17]

Attention is all you need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008, 2017

work page 2017

[18] [18]

Long short-term memory

Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997

work page 1997

[19] [19]

Gated feedback recurrent neural networks

Junyoung Chung, Caglar Gulcehre, Kyunghyun Cho, and Yoshua Bengio. Gated feedback recurrent neural networks. In International Conference on Machine Learning, pages 2067–2075, 2015

work page 2067

[20] [20]

An in-depth look at google’s ﬁrst tensor processing unit (tpu).Google Cloud Big Data and Machine Learning Blog, 12, 2017

Kaz Sato, Cliff Young, and David Patterson. An in-depth look at google’s ﬁrst tensor processing unit (tpu).Google Cloud Big Data and Machine Learning Blog, 12, 2017

work page 2017

[21] [21]

Linguistic Knowledge and Transferability of Contextual Representations

Nelson F Liu, Matt Gardner, Yonatan Belinkov, Matthew Peters, and Noah A Smith. Linguistic knowledge and transferability of contextual representations. arXiv preprint arXiv:1903.08855, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1903

[22] [22]

BERT and PALs: Projected Attention Layers for Efficient Adaptation in Multi-Task Learning

Asa Cooper Stickland and Iain Murray. Bert and pals: Projected attention layers for efﬁcient adaptation in multi-task learning. arXiv preprint arXiv:1902.02671, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1902

[23] [23]

BERT Post-Training for Review Reading Comprehension and Aspect-based Sentiment Analysis

Hu Xu, Bing Liu, Lei Shu, and Philip S Yu. Bert post-training for review reading comprehension and aspect-based sentiment analysis. arXiv preprint arXiv:1904.02232, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1904

[24] [24]

How to ﬁne-tune bert for text classiﬁcation?arXiv preprint arXiv:1905.05583, 2019

Chi Sun, Xipeng Qiu, Yige Xu, and Xuanjing Huang. How to ﬁne-tune bert for text classiﬁcation?arXiv preprint arXiv:1905.05583, 2019

work page arXiv 1905

[25] [25]

Recursive deep models for semantic compositionality over a sentiment treebank

Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D Manning, Andrew Ng, and Christopher Potts. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on empirical methods in natural language processing, pages 1631–1642, 2013

work page 2013

[26] [26]

A broad-coverage challenge corpus for sentence understanding through inference

Adina Williams, Nikita Nangia, and Samuel R Bowman. A broad-coverage challenge corpus for sentence understanding through inference. arXiv preprint arXiv:1704.05426, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[27] [27]

Automatically constructing a corpus of sentential paraphrases

William B Dolan and Chris Brockett. Automatically constructing a corpus of sentential paraphrases. InProceedings of the Third International Workshop on Paraphrasing (IWP2005), 2005

work page 2005

[28] [28]

SemEval-2017 Task 1: Semantic Textual Similarity - Multilingual and Cross-lingual Focused Evaluation

Daniel Cer, Mona Diab, Eneko Agirre, Inigo Lopez-Gazpio, and Lucia Specia. Semeval-2017 task 1: Semantic textual similarity-multilingual and cross-lingual focused evaluation. arXiv preprint arXiv:1708.00055, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[29] [29]

Delving deep into rectiﬁers: Surpassing human-level performance on imagenet classiﬁcation

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectiﬁers: Surpassing human-level performance on imagenet classiﬁcation. In Proceedings of the IEEE international conference on computer vision, pages 1026–1034, 2015

work page 2015

[30] [30]

Glove: Global vectors for word representation

Jeffrey Pennington, Richard Socher, and Christopher Manning. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 1532–1543, 2014

work page 2014

[31] [31]

Named entity recognition with bidirectional lstm-cnns

Jason PC Chiu and Eric Nichols. Named entity recognition with bidirectional lstm-cnns. Transactions of the Association for Computational Linguistics, 4:357–370, 2016

work page 2016

[32] [32]

Neural Architectures for Named Entity Recognition

Guillaume Lample, Miguel Ballesteros, Sandeep Subramanian, Kazuya Kawakami, and Chris Dyer. Neural architectures for named entity recognition. arXiv preprint arXiv:1603.01360, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[33] [33]

End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF

Xuezhe Ma and Eduard Hovy. End-to-end sequence labeling via bi-directional lstm-cnns-crf. arXiv preprint arXiv:1603.01354, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[34] [34]

Do convolutional networks need to be deep for text classiﬁcation? In Workshops at the Thirty-Second AAAI Conference on Artiﬁcial Intelligence, 2018

Hoa T Le, Christophe Cerisara, and Alexandre Denis. Do convolutional networks need to be deep for text classiﬁcation? In Workshops at the Thirty-Second AAAI Conference on Artiﬁcial Intelligence, 2018

work page 2018

[35] [35]

Highway long short-term memory rnns for distant speech recognition

Yu Zhang, Guoguo Chen, Dong Yu, Kaisheng Yaco, Sanjeev Khudanpur, and James Glass. Highway long short-term memory rnns for distant speech recognition. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5755–5759. IEEE, 2016

work page 2016

[36] [36]

Siamese recurrent architectures for learning sentence similarity

Jonas Mueller and Aditya Thyagarajan. Siamese recurrent architectures for learning sentence similarity. In Thirtieth AAAI Conference on Artiﬁcial Intelligence, 2016

work page 2016

[37] [37]

A Compare-Aggregate Model for Matching Text Sequences

Shuohang Wang and Jing Jiang. A compare-aggregate model for matching text sequences. arXiv preprint arXiv:1611.01747, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[38] [38]

Enhanced lstm for natural language inference

Qian Chen, Xiaodan Zhu, Zhenhua Ling, Si Wei, Hui Jiang, and Diana Inkpen. Enhanced lstm for natural language inference. arXiv preprint arXiv:1609.06038, 2016

work page arXiv 2016

[39] [39]

Bilateral Multi-Perspective Matching for Natural Language Sentences

Zhiguo Wang, Wael Hamza, and Radu Florian. Bilateral multi-perspective matching for natural language sentences. arXiv preprint arXiv:1702.03814, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[40] [40]

Deep visual domain adaptation: A survey

Mei Wang and Weihong Deng. Deep visual domain adaptation: A survey. Neurocomputing, 312:135–153, 2018. 7

work page 2018