pith. sign in

arxiv: 1907.05338 · v1 · pith:UKRDTWNAnew · submitted 2019-07-09 · 💻 cs.CL · cs.LG

To Tune or Not To Tune? How About the Best of Both Worlds?

Pith reviewed 2026-05-25 00:43 UTC · model grok-4.3

classification 💻 cs.CL cs.LG
keywords BERT adaptationfine-tuningpre-trained language modelssemantic similaritysequence labelingtext classificationmodel tuning
0
0 comments X

The pith

A two-stage adaptation method first freezes BERT then jointly fine-tunes the full model to raise accuracy on downstream tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines how to adapt pre-trained language models like BERT for specific tasks. It introduces a procedure that begins by training a task model while keeping BERT parameters frozen, then continues by fine-tuning all parameters together. Experiments on three tasks report accuracy gains of 4.7 percent on semantic similarity, 0.99 percent on sequence labeling, and 0.72 percent on text classification. The approach is presented as a way to obtain the benefits of both stable early adaptation and later full optimization.

Core claim

Training first with frozen BERT parameters and then fine-tuning the entire model together produces higher accuracy than standard single-stage adaptation on semantic similarity, sequence labeling, and text classification tasks.

What carries the argument

The two-stage adaptation procedure that starts with frozen pre-trained parameters before unfreezing them for joint optimization.

If this is right

  • The staged method improves results over either freezing BERT entirely or fine-tuning from the outset on the tested tasks.
  • A light-weight task head combined with the two-stage schedule can replace more complex heads while still raising accuracy.
  • The same initial-freeze-then-full-tune sequence applies across semantic similarity, sequence labeling, and classification tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The early frozen phase may stabilize learning of task-specific features before the model parameters are allowed to shift.
  • The procedure could be tested on other pre-trained models such as RoBERTa or GPT variants to check whether similar staged gains appear.
  • If the gains hold, the method offers a low-cost way to adapt large models without redesigning task architectures.

Load-bearing premise

The reported accuracy gains are produced by the two-stage freezing-then-fine-tuning sequence itself rather than by differences in hyperparameters, seeds, or baseline choices.

What would settle it

Re-running the three tasks with the same data splits, seeds, and single-stage fine-tuning or frozen-head baselines and obtaining equal or higher accuracy would show the two-stage method does not drive the gains.

Figures

Figures reproduced from arXiv: 1907.05338 by Chunye Wang, Haibo Su, Jupeng Ding, Kailin Ji, Ran Wang.

Figure 1
Figure 1. Figure 1: The Difference Between BERT and Open-GPT, extracted from Devlin et al. [2], Figure 1 [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
read the original abstract

The introduction of pre-trained language models has revolutionized natural language research communities. However, researchers still know relatively little regarding their theoretical and empirical properties. In this regard, Peters et al. perform several experiments which demonstrate that it is better to adapt BERT with a light-weight task-specific head, rather than building a complex one on top of the pre-trained language model, and freeze the parameters in the said language model. However, there is another option to adopt. In this paper, we propose a new adaptation method which we first train the task model with the BERT parameters frozen and then fine-tune the entire model together. Our experimental results show that our model adaptation method can achieve 4.7% accuracy improvement in semantic similarity task, 0.99% accuracy improvement in sequence labeling task and 0.72% accuracy improvement in the text classification task.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper proposes a two-stage adaptation procedure for pre-trained models such as BERT: first train a task-specific head with the language-model parameters frozen, then jointly fine-tune the entire model. It reports that this schedule produces accuracy gains of 4.7% on a semantic-similarity task, 0.99% on a sequence-labeling task, and 0.72% on a text-classification task relative to the single-stage baselines of Peters et al.

Significance. If the reported deltas can be shown to arise from the two-stage schedule rather than from uncontrolled differences in hyper-parameters, seeds, or baseline implementations, the result would supply a simple, low-cost recipe that combines the stability of frozen adaptation with the flexibility of full fine-tuning. At present the manuscript supplies no evidence that permits this attribution.

major comments (1)
  1. [Abstract / Results] Abstract and experimental-results section: the central performance claims (4.7%, 0.99%, 0.72% absolute accuracy gains) are presented without any statement of the number of runs, random seeds, standard deviations, statistical tests, exact dataset splits, learning-rate schedules, or the precise re-implementation of the Peters et al. baselines. Because every other experimental factor must be held identical for the attribution to the two-stage schedule to be valid, the absence of these controls renders the headline claim unevaluable.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for highlighting the importance of rigorous experimental reporting. The manuscript as submitted does not include the requested details on runs, seeds, and baseline implementations. We will revise the experimental section to provide this information, allowing for a clearer attribution of the observed gains to the proposed two-stage adaptation procedure.

read point-by-point responses
  1. Referee: [Abstract / Results] Abstract and experimental-results section: the central performance claims (4.7%, 0.99%, 0.72% absolute accuracy gains) are presented without any statement of the number of runs, random seeds, standard deviations, statistical tests, exact dataset splits, learning-rate schedules, or the precise re-implementation of the Peters et al. baselines. Because every other experimental factor must be held identical for the attribution to the two-stage schedule to be valid, the absence of these controls renders the headline claim unevaluable.

    Authors: We concur that without these controls, it is difficult to attribute the gains solely to the two-stage schedule. In the revised version of the manuscript, we will include: the number of independent runs and random seeds used, standard deviations of the results, any statistical tests performed, the exact dataset splits, the learning-rate schedules employed, and a precise description of how the Peters et al. baselines were re-implemented. This will strengthen the claim that the improvements come from the adaptation method rather than other factors. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical reporting of accuracy deltas

full rationale

The paper proposes a two-stage adaptation procedure for BERT and reports accuracy gains on three tasks (4.7%, 0.99%, 0.72%). No equations, derivations, fitted parameters, or self-citations appear in the provided text. The central claim is an empirical comparison whose validity rests on experimental controls rather than any mathematical reduction to inputs. This matches the reader's 0.0 assessment; no load-bearing step reduces by construction to the paper's own definitions or prior outputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The work is an empirical proposal with no mathematical content; the abstract introduces neither free parameters, background axioms, nor new postulated entities.

pith-pipeline@v0.9.0 · 5682 in / 1163 out tokens · 46741 ms · 2026-05-25T00:43:27.452859+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

40 extracted references · 40 canonical work pages · 19 internal anchors

  1. [1]

    To Tune or Not to Tune? Adapting Pretrained Representations to Diverse Tasks

    Matthew Peters, Sebastian Ruder, and Noah A Smith. To tune or not to tune? adapting pretrained representations to diverse tasks. arXiv preprint arXiv:1903.05987, 2019

  2. [2]

    BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018

  3. [3]

    Language models are unsupervised multitask learners

    Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. OpenAI Blog, 1:8, 2019

  4. [4]

    SQuAD: 100,000+ Questions for Machine Comprehension of Text

    Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. Squad: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250, 2016

  5. [5]

    CoQA: A Conversational Question Answering Challenge

    Siva Reddy, Danqi Chen, and Christopher D Manning. Coqa: A conversational question answering challenge. arXiv preprint arXiv:1808.07042, 2018

  6. [6]

    Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition

    Erik F Sang and Fien De Meulder. Introduction to the conll-2003 shared task: Language-independent named entity recognition. arXiv preprint cs/0306050, 2003

  7. [7]

    GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding

    Alex Wang, Amapreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R Bowman. Glue: A multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461, 2018

  8. [8]

    Montreal neural machine translation systems for wmt’15

    Sébastien Jean, Orhan Firat, Kyunghyun Cho, Roland Memisevic, and Yoshua Bengio. Montreal neural machine translation systems for wmt’15. In Proceedings of the Tenth Workshop on Statistical Machine Translation, pages 134–140, 2015

  9. [9]

    ERNIE: Enhanced Representation through Knowledge Integration

    Yu Sun, Shuohuan Wang, Yukun Li, Shikun Feng, Xuyi Chen, Han Zhang, Xin Tian, Danxiang Zhu, Hao Tian, and Hua Wu. Ernie: Enhanced representation through knowledge integration. arXiv preprint arXiv:1904.09223, 2019

  10. [10]

    Multi-Task Deep Neural Networks for Natural Language Understanding

    Xiaodong Liu, Pengcheng He, Weizhu Chen, and Jianfeng Gao. Multi-task deep neural networks for natural language understanding. arXiv preprint arXiv:1901.11504, 2019

  11. [11]

    Efficient estimation of word representations in vector space

    Tomas Mikolov, Kai Chen, Gregory S Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector space. international conference on learning representations, 2013

  12. [12]

    Learned in translation: Contextualized word vectors

    Bryan McCann, James Bradbury, Caiming Xiong, and Richard Socher. Learned in translation: Contextualized word vectors. In Advances in Neural Information Processing Systems, pages 6294–6305, 2017

  13. [13]

    Multi-view Recurrent Neural Acoustic Word Embeddings

    Wanjia He, Weiran Wang, and Karen Livescu. Multi-view recurrent neural acoustic word embeddings. arXiv preprint arXiv:1611.04496, 2016

  14. [14]

    Semi-Supervised Sequence Modeling with Cross-View Training

    Kevin Clark, Minh-Thang Luong, Christopher D Manning, and Quoc V Le. Semi-supervised sequence modeling with cross-view training. arXiv preprint arXiv:1809.08370, 2018

  15. [15]

    Deep contextualized word representations

    Matthew E Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher G Clark, Kenton Lee, and Luke S Zettlemoyer. Deep contextualized word representations. north american chapter of the association for computa- tional linguistics, 1:2227–2237, 2018

  16. [16]

    Universal language model fine-tuning for text classification.meeting of the association for computational linguistics, 1:328–339, 2018

    Jeremy Howard and Sebastian Ruder. Universal language model fine-tuning for text classification.meeting of the association for computational linguistics, 1:328–339, 2018. 6 To Tune or Not To Tune? How About the Best of Both Worlds? A PREPRINT

  17. [17]

    Attention is all you need

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008, 2017

  18. [18]

    Long short-term memory

    Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997

  19. [19]

    Gated feedback recurrent neural networks

    Junyoung Chung, Caglar Gulcehre, Kyunghyun Cho, and Yoshua Bengio. Gated feedback recurrent neural networks. In International Conference on Machine Learning, pages 2067–2075, 2015

  20. [20]

    An in-depth look at google’s first tensor processing unit (tpu).Google Cloud Big Data and Machine Learning Blog, 12, 2017

    Kaz Sato, Cliff Young, and David Patterson. An in-depth look at google’s first tensor processing unit (tpu).Google Cloud Big Data and Machine Learning Blog, 12, 2017

  21. [21]

    Linguistic Knowledge and Transferability of Contextual Representations

    Nelson F Liu, Matt Gardner, Yonatan Belinkov, Matthew Peters, and Noah A Smith. Linguistic knowledge and transferability of contextual representations. arXiv preprint arXiv:1903.08855, 2019

  22. [22]

    BERT and PALs: Projected Attention Layers for Efficient Adaptation in Multi-Task Learning

    Asa Cooper Stickland and Iain Murray. Bert and pals: Projected attention layers for efficient adaptation in multi-task learning. arXiv preprint arXiv:1902.02671, 2019

  23. [23]

    BERT Post-Training for Review Reading Comprehension and Aspect-based Sentiment Analysis

    Hu Xu, Bing Liu, Lei Shu, and Philip S Yu. Bert post-training for review reading comprehension and aspect-based sentiment analysis. arXiv preprint arXiv:1904.02232, 2019

  24. [24]

    How to fine-tune bert for text classification?arXiv preprint arXiv:1905.05583, 2019

    Chi Sun, Xipeng Qiu, Yige Xu, and Xuanjing Huang. How to fine-tune bert for text classification?arXiv preprint arXiv:1905.05583, 2019

  25. [25]

    Recursive deep models for semantic compositionality over a sentiment treebank

    Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D Manning, Andrew Ng, and Christopher Potts. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on empirical methods in natural language processing, pages 1631–1642, 2013

  26. [26]

    A broad-coverage challenge corpus for sentence understanding through inference

    Adina Williams, Nikita Nangia, and Samuel R Bowman. A broad-coverage challenge corpus for sentence understanding through inference. arXiv preprint arXiv:1704.05426, 2017

  27. [27]

    Automatically constructing a corpus of sentential paraphrases

    William B Dolan and Chris Brockett. Automatically constructing a corpus of sentential paraphrases. InProceedings of the Third International Workshop on Paraphrasing (IWP2005), 2005

  28. [28]

    SemEval-2017 Task 1: Semantic Textual Similarity - Multilingual and Cross-lingual Focused Evaluation

    Daniel Cer, Mona Diab, Eneko Agirre, Inigo Lopez-Gazpio, and Lucia Specia. Semeval-2017 task 1: Semantic textual similarity-multilingual and cross-lingual focused evaluation. arXiv preprint arXiv:1708.00055, 2017

  29. [29]

    Delving deep into rectifiers: Surpassing human-level performance on imagenet classification

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE international conference on computer vision, pages 1026–1034, 2015

  30. [30]

    Glove: Global vectors for word representation

    Jeffrey Pennington, Richard Socher, and Christopher Manning. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 1532–1543, 2014

  31. [31]

    Named entity recognition with bidirectional lstm-cnns

    Jason PC Chiu and Eric Nichols. Named entity recognition with bidirectional lstm-cnns. Transactions of the Association for Computational Linguistics, 4:357–370, 2016

  32. [32]

    Neural Architectures for Named Entity Recognition

    Guillaume Lample, Miguel Ballesteros, Sandeep Subramanian, Kazuya Kawakami, and Chris Dyer. Neural architectures for named entity recognition. arXiv preprint arXiv:1603.01360, 2016

  33. [33]

    End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF

    Xuezhe Ma and Eduard Hovy. End-to-end sequence labeling via bi-directional lstm-cnns-crf. arXiv preprint arXiv:1603.01354, 2016

  34. [34]

    Do convolutional networks need to be deep for text classification? In Workshops at the Thirty-Second AAAI Conference on Artificial Intelligence, 2018

    Hoa T Le, Christophe Cerisara, and Alexandre Denis. Do convolutional networks need to be deep for text classification? In Workshops at the Thirty-Second AAAI Conference on Artificial Intelligence, 2018

  35. [35]

    Highway long short-term memory rnns for distant speech recognition

    Yu Zhang, Guoguo Chen, Dong Yu, Kaisheng Yaco, Sanjeev Khudanpur, and James Glass. Highway long short-term memory rnns for distant speech recognition. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5755–5759. IEEE, 2016

  36. [36]

    Siamese recurrent architectures for learning sentence similarity

    Jonas Mueller and Aditya Thyagarajan. Siamese recurrent architectures for learning sentence similarity. In Thirtieth AAAI Conference on Artificial Intelligence, 2016

  37. [37]

    A Compare-Aggregate Model for Matching Text Sequences

    Shuohang Wang and Jing Jiang. A compare-aggregate model for matching text sequences. arXiv preprint arXiv:1611.01747, 2016

  38. [38]

    Enhanced lstm for natural language inference

    Qian Chen, Xiaodan Zhu, Zhenhua Ling, Si Wei, Hui Jiang, and Diana Inkpen. Enhanced lstm for natural language inference. arXiv preprint arXiv:1609.06038, 2016

  39. [39]

    Bilateral Multi-Perspective Matching for Natural Language Sentences

    Zhiguo Wang, Wael Hamza, and Radu Florian. Bilateral multi-perspective matching for natural language sentences. arXiv preprint arXiv:1702.03814, 2017

  40. [40]

    Deep visual domain adaptation: A survey

    Mei Wang and Weihong Deng. Deep visual domain adaptation: A survey. Neurocomputing, 312:135–153, 2018. 7