To Tune or Not To Tune? How About the Best of Both Worlds?
Pith reviewed 2026-05-25 00:43 UTC · model grok-4.3
The pith
A two-stage adaptation method first freezes BERT then jointly fine-tunes the full model to raise accuracy on downstream tasks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Training first with frozen BERT parameters and then fine-tuning the entire model together produces higher accuracy than standard single-stage adaptation on semantic similarity, sequence labeling, and text classification tasks.
What carries the argument
The two-stage adaptation procedure that starts with frozen pre-trained parameters before unfreezing them for joint optimization.
If this is right
- The staged method improves results over either freezing BERT entirely or fine-tuning from the outset on the tested tasks.
- A light-weight task head combined with the two-stage schedule can replace more complex heads while still raising accuracy.
- The same initial-freeze-then-full-tune sequence applies across semantic similarity, sequence labeling, and classification tasks.
Where Pith is reading between the lines
- The early frozen phase may stabilize learning of task-specific features before the model parameters are allowed to shift.
- The procedure could be tested on other pre-trained models such as RoBERTa or GPT variants to check whether similar staged gains appear.
- If the gains hold, the method offers a low-cost way to adapt large models without redesigning task architectures.
Load-bearing premise
The reported accuracy gains are produced by the two-stage freezing-then-fine-tuning sequence itself rather than by differences in hyperparameters, seeds, or baseline choices.
What would settle it
Re-running the three tasks with the same data splits, seeds, and single-stage fine-tuning or frozen-head baselines and obtaining equal or higher accuracy would show the two-stage method does not drive the gains.
Figures
read the original abstract
The introduction of pre-trained language models has revolutionized natural language research communities. However, researchers still know relatively little regarding their theoretical and empirical properties. In this regard, Peters et al. perform several experiments which demonstrate that it is better to adapt BERT with a light-weight task-specific head, rather than building a complex one on top of the pre-trained language model, and freeze the parameters in the said language model. However, there is another option to adopt. In this paper, we propose a new adaptation method which we first train the task model with the BERT parameters frozen and then fine-tune the entire model together. Our experimental results show that our model adaptation method can achieve 4.7% accuracy improvement in semantic similarity task, 0.99% accuracy improvement in sequence labeling task and 0.72% accuracy improvement in the text classification task.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes a two-stage adaptation procedure for pre-trained models such as BERT: first train a task-specific head with the language-model parameters frozen, then jointly fine-tune the entire model. It reports that this schedule produces accuracy gains of 4.7% on a semantic-similarity task, 0.99% on a sequence-labeling task, and 0.72% on a text-classification task relative to the single-stage baselines of Peters et al.
Significance. If the reported deltas can be shown to arise from the two-stage schedule rather than from uncontrolled differences in hyper-parameters, seeds, or baseline implementations, the result would supply a simple, low-cost recipe that combines the stability of frozen adaptation with the flexibility of full fine-tuning. At present the manuscript supplies no evidence that permits this attribution.
major comments (1)
- [Abstract / Results] Abstract and experimental-results section: the central performance claims (4.7%, 0.99%, 0.72% absolute accuracy gains) are presented without any statement of the number of runs, random seeds, standard deviations, statistical tests, exact dataset splits, learning-rate schedules, or the precise re-implementation of the Peters et al. baselines. Because every other experimental factor must be held identical for the attribution to the two-stage schedule to be valid, the absence of these controls renders the headline claim unevaluable.
Simulated Author's Rebuttal
We thank the referee for highlighting the importance of rigorous experimental reporting. The manuscript as submitted does not include the requested details on runs, seeds, and baseline implementations. We will revise the experimental section to provide this information, allowing for a clearer attribution of the observed gains to the proposed two-stage adaptation procedure.
read point-by-point responses
-
Referee: [Abstract / Results] Abstract and experimental-results section: the central performance claims (4.7%, 0.99%, 0.72% absolute accuracy gains) are presented without any statement of the number of runs, random seeds, standard deviations, statistical tests, exact dataset splits, learning-rate schedules, or the precise re-implementation of the Peters et al. baselines. Because every other experimental factor must be held identical for the attribution to the two-stage schedule to be valid, the absence of these controls renders the headline claim unevaluable.
Authors: We concur that without these controls, it is difficult to attribute the gains solely to the two-stage schedule. In the revised version of the manuscript, we will include: the number of independent runs and random seeds used, standard deviations of the results, any statistical tests performed, the exact dataset splits, the learning-rate schedules employed, and a precise description of how the Peters et al. baselines were re-implemented. This will strengthen the claim that the improvements come from the adaptation method rather than other factors. revision: yes
Circularity Check
No circularity: purely empirical reporting of accuracy deltas
full rationale
The paper proposes a two-stage adaptation procedure for BERT and reports accuracy gains on three tasks (4.7%, 0.99%, 0.72%). No equations, derivations, fitted parameters, or self-citations appear in the provided text. The central claim is an empirical comparison whose validity rests on experimental controls rather than any mathematical reduction to inputs. This matches the reader's 0.0 assessment; no load-bearing step reduces by construction to the paper's own definitions or prior outputs.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
To Tune or Not to Tune? Adapting Pretrained Representations to Diverse Tasks
Matthew Peters, Sebastian Ruder, and Noah A Smith. To tune or not to tune? adapting pretrained representations to diverse tasks. arXiv preprint arXiv:1903.05987, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1903
-
[2]
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[3]
Language models are unsupervised multitask learners
Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. OpenAI Blog, 1:8, 2019
work page 2019
-
[4]
SQuAD: 100,000+ Questions for Machine Comprehension of Text
Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. Squad: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[5]
CoQA: A Conversational Question Answering Challenge
Siva Reddy, Danqi Chen, and Christopher D Manning. Coqa: A conversational question answering challenge. arXiv preprint arXiv:1808.07042, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[6]
Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition
Erik F Sang and Fien De Meulder. Introduction to the conll-2003 shared task: Language-independent named entity recognition. arXiv preprint cs/0306050, 2003
work page internal anchor Pith review Pith/arXiv arXiv 2003
-
[7]
GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding
Alex Wang, Amapreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R Bowman. Glue: A multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[8]
Montreal neural machine translation systems for wmt’15
Sébastien Jean, Orhan Firat, Kyunghyun Cho, Roland Memisevic, and Yoshua Bengio. Montreal neural machine translation systems for wmt’15. In Proceedings of the Tenth Workshop on Statistical Machine Translation, pages 134–140, 2015
work page 2015
-
[9]
ERNIE: Enhanced Representation through Knowledge Integration
Yu Sun, Shuohuan Wang, Yukun Li, Shikun Feng, Xuyi Chen, Han Zhang, Xin Tian, Danxiang Zhu, Hao Tian, and Hua Wu. Ernie: Enhanced representation through knowledge integration. arXiv preprint arXiv:1904.09223, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1904
-
[10]
Multi-Task Deep Neural Networks for Natural Language Understanding
Xiaodong Liu, Pengcheng He, Weizhu Chen, and Jianfeng Gao. Multi-task deep neural networks for natural language understanding. arXiv preprint arXiv:1901.11504, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1901
-
[11]
Efficient estimation of word representations in vector space
Tomas Mikolov, Kai Chen, Gregory S Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector space. international conference on learning representations, 2013
work page 2013
-
[12]
Learned in translation: Contextualized word vectors
Bryan McCann, James Bradbury, Caiming Xiong, and Richard Socher. Learned in translation: Contextualized word vectors. In Advances in Neural Information Processing Systems, pages 6294–6305, 2017
work page 2017
-
[13]
Multi-view Recurrent Neural Acoustic Word Embeddings
Wanjia He, Weiran Wang, and Karen Livescu. Multi-view recurrent neural acoustic word embeddings. arXiv preprint arXiv:1611.04496, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[14]
Semi-Supervised Sequence Modeling with Cross-View Training
Kevin Clark, Minh-Thang Luong, Christopher D Manning, and Quoc V Le. Semi-supervised sequence modeling with cross-view training. arXiv preprint arXiv:1809.08370, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[15]
Deep contextualized word representations
Matthew E Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher G Clark, Kenton Lee, and Luke S Zettlemoyer. Deep contextualized word representations. north american chapter of the association for computa- tional linguistics, 1:2227–2237, 2018
work page 2018
-
[16]
Jeremy Howard and Sebastian Ruder. Universal language model fine-tuning for text classification.meeting of the association for computational linguistics, 1:328–339, 2018. 6 To Tune or Not To Tune? How About the Best of Both Worlds? A PREPRINT
work page 2018
-
[17]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008, 2017
work page 2017
-
[18]
Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997
work page 1997
-
[19]
Gated feedback recurrent neural networks
Junyoung Chung, Caglar Gulcehre, Kyunghyun Cho, and Yoshua Bengio. Gated feedback recurrent neural networks. In International Conference on Machine Learning, pages 2067–2075, 2015
work page 2067
-
[20]
Kaz Sato, Cliff Young, and David Patterson. An in-depth look at google’s first tensor processing unit (tpu).Google Cloud Big Data and Machine Learning Blog, 12, 2017
work page 2017
-
[21]
Linguistic Knowledge and Transferability of Contextual Representations
Nelson F Liu, Matt Gardner, Yonatan Belinkov, Matthew Peters, and Noah A Smith. Linguistic knowledge and transferability of contextual representations. arXiv preprint arXiv:1903.08855, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1903
-
[22]
BERT and PALs: Projected Attention Layers for Efficient Adaptation in Multi-Task Learning
Asa Cooper Stickland and Iain Murray. Bert and pals: Projected attention layers for efficient adaptation in multi-task learning. arXiv preprint arXiv:1902.02671, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1902
-
[23]
BERT Post-Training for Review Reading Comprehension and Aspect-based Sentiment Analysis
Hu Xu, Bing Liu, Lei Shu, and Philip S Yu. Bert post-training for review reading comprehension and aspect-based sentiment analysis. arXiv preprint arXiv:1904.02232, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1904
-
[24]
How to fine-tune bert for text classification?arXiv preprint arXiv:1905.05583, 2019
Chi Sun, Xipeng Qiu, Yige Xu, and Xuanjing Huang. How to fine-tune bert for text classification?arXiv preprint arXiv:1905.05583, 2019
-
[25]
Recursive deep models for semantic compositionality over a sentiment treebank
Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D Manning, Andrew Ng, and Christopher Potts. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on empirical methods in natural language processing, pages 1631–1642, 2013
work page 2013
-
[26]
A broad-coverage challenge corpus for sentence understanding through inference
Adina Williams, Nikita Nangia, and Samuel R Bowman. A broad-coverage challenge corpus for sentence understanding through inference. arXiv preprint arXiv:1704.05426, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[27]
Automatically constructing a corpus of sentential paraphrases
William B Dolan and Chris Brockett. Automatically constructing a corpus of sentential paraphrases. InProceedings of the Third International Workshop on Paraphrasing (IWP2005), 2005
work page 2005
-
[28]
SemEval-2017 Task 1: Semantic Textual Similarity - Multilingual and Cross-lingual Focused Evaluation
Daniel Cer, Mona Diab, Eneko Agirre, Inigo Lopez-Gazpio, and Lucia Specia. Semeval-2017 task 1: Semantic textual similarity-multilingual and cross-lingual focused evaluation. arXiv preprint arXiv:1708.00055, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[29]
Delving deep into rectifiers: Surpassing human-level performance on imagenet classification
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE international conference on computer vision, pages 1026–1034, 2015
work page 2015
-
[30]
Glove: Global vectors for word representation
Jeffrey Pennington, Richard Socher, and Christopher Manning. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 1532–1543, 2014
work page 2014
-
[31]
Named entity recognition with bidirectional lstm-cnns
Jason PC Chiu and Eric Nichols. Named entity recognition with bidirectional lstm-cnns. Transactions of the Association for Computational Linguistics, 4:357–370, 2016
work page 2016
-
[32]
Neural Architectures for Named Entity Recognition
Guillaume Lample, Miguel Ballesteros, Sandeep Subramanian, Kazuya Kawakami, and Chris Dyer. Neural architectures for named entity recognition. arXiv preprint arXiv:1603.01360, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[33]
End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF
Xuezhe Ma and Eduard Hovy. End-to-end sequence labeling via bi-directional lstm-cnns-crf. arXiv preprint arXiv:1603.01354, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[34]
Hoa T Le, Christophe Cerisara, and Alexandre Denis. Do convolutional networks need to be deep for text classification? In Workshops at the Thirty-Second AAAI Conference on Artificial Intelligence, 2018
work page 2018
-
[35]
Highway long short-term memory rnns for distant speech recognition
Yu Zhang, Guoguo Chen, Dong Yu, Kaisheng Yaco, Sanjeev Khudanpur, and James Glass. Highway long short-term memory rnns for distant speech recognition. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5755–5759. IEEE, 2016
work page 2016
-
[36]
Siamese recurrent architectures for learning sentence similarity
Jonas Mueller and Aditya Thyagarajan. Siamese recurrent architectures for learning sentence similarity. In Thirtieth AAAI Conference on Artificial Intelligence, 2016
work page 2016
-
[37]
A Compare-Aggregate Model for Matching Text Sequences
Shuohang Wang and Jing Jiang. A compare-aggregate model for matching text sequences. arXiv preprint arXiv:1611.01747, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[38]
Enhanced lstm for natural language inference
Qian Chen, Xiaodan Zhu, Zhenhua Ling, Si Wei, Hui Jiang, and Diana Inkpen. Enhanced lstm for natural language inference. arXiv preprint arXiv:1609.06038, 2016
-
[39]
Bilateral Multi-Perspective Matching for Natural Language Sentences
Zhiguo Wang, Wael Hamza, and Radu Florian. Bilateral multi-perspective matching for natural language sentences. arXiv preprint arXiv:1702.03814, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[40]
Deep visual domain adaptation: A survey
Mei Wang and Weihong Deng. Deep visual domain adaptation: A survey. Neurocomputing, 312:135–153, 2018. 7
work page 2018
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.