From BERT to T5: A Study of Named Entity Recognition
Pith reviewed 2026-05-20 11:26 UTC · model grok-4.3
The pith
BERT with weighted cross-entropy and T5 with few-shot prompts show different strengths on named entity recognition under 7-class and 3-class tag schemes.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By fine-tuning BERT with a weighted cross-entropy loss and adapting T5 through few-shot prompts, then evaluating both under the original 7-class tag scheme and a simplified 3-class scheme, the study produces comparative performance metrics, ablation results, and error breakdowns that illuminate the relative abilities of the two architectures for sequence labelling.
What carries the argument
The BERT classifier head trained with weighted cross-entropy together with the T5 sequence-to-sequence model driven by few-shot prompts, applied to named-entity tag sequences.
If this is right
- BERT may be the stronger default choice when the full set of entity classes must be distinguished.
- T5 may suit settings where label sets are simplified or training examples are scarce.
- Error patterns identified for each model can direct targeted fixes such as better boundary handling.
- Ablation outcomes indicate which hyperparameters matter most for stable results from each architecture.
Where Pith is reading between the lines
- The same comparison setup could be reused on other sequence-labelling problems such as chunking or semantic role labelling.
- Results point toward testing whether the observed trade-offs persist on domain-specific or low-resource NER datasets.
- Hybrid systems that route examples to the stronger model for a given tag granularity might improve overall accuracy.
Load-bearing premise
The chosen fine-tuning procedures, prompt formats, and validation strategies constitute representative and fair tests of each model's capacity for the NER task.
What would settle it
Repeating the experiments with alternate prompt formats or loss weights that reverse the observed performance ordering between BERT and T5 on the 7-class scheme would undermine the reported relative abilities.
Figures
read the original abstract
Named entity recognition (NER) has been one of the essential preliminary steps in modern NLP applications. This report focuses on implementing the NER task on finetuning two pretrained models: (i) an encoder-only model (BERT) with a simple classification head, and (ii) a sequence-to-sequence model (T5) with few-shot prompts. Under the original 7-class tag and 3-class simplified tag schemes, BERT is applied a weighted cross-entropy for training loss, and T5 is fine-tuned with two validation strategies. It also conducted an ablation study with different hyperparameters. Moreover, the related analysis provides valuable insights into common errors in BERT and the two models' performance. Based on a bunch of performance metrics, this report aims to compare the above two architectures and explore their abilities in the sequence labelling task, laying the groundwork for further practical use cases.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript compares an encoder-only model (BERT with a token classification head and weighted cross-entropy loss) against a seq2seq model (T5 with few-shot prompting) on named entity recognition under both the original 7-class and a simplified 3-class tag scheme. It reports results from hyperparameter ablations, two validation strategies for T5, and error analysis to draw conclusions about the relative abilities of the two architectures for sequence labeling.
Significance. If the T5 prompting, decoding, and span-extraction procedures are demonstrated to be representative instantiations, the work supplies practical empirical guidance on when encoder-only versus encoder-decoder models may be preferable for NER, together with concrete error patterns that practitioners can use when selecting architectures. The inclusion of both tag-scheme variants and ablation results strengthens the utility of the comparison.
major comments (2)
- [T5 experimental setup] Section describing T5 experimental setup: the prompt templates, output format constraints, and procedure for parsing generated text into entity spans are not specified. Because the central claim rests on a fair head-to-head comparison of token-classification versus seq2seq capacity, the absence of these details leaves open the possibility that observed gaps are artifacts of prompt engineering or post-hoc parsing rather than architectural differences.
- [Results and discussion] Results and discussion sections: no quantitative comparison is provided between the two T5 validation strategies, nor is it shown whether either strategy approaches the performance obtainable by standard supervised fine-tuning of T5 on the same NER data. This information is load-bearing for the claim that the reported T5 numbers reflect seq2seq model capacity.
minor comments (2)
- [Abstract] The abstract states that T5 is 'fine-tuned with two validation strategies' while the title and later text refer to 'few-shot prompts'; the terminology should be made consistent throughout.
- [Results] Table or figure reporting main results: error bars or standard deviations across runs are not mentioned; adding them would strengthen the reliability of the reported performance differences.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback on our manuscript comparing BERT and T5 for named entity recognition. We address each major comment below and indicate the revisions we will make to strengthen the paper.
read point-by-point responses
-
Referee: [T5 experimental setup] Section describing T5 experimental setup: the prompt templates, output format constraints, and procedure for parsing generated text into entity spans are not specified. Because the central claim rests on a fair head-to-head comparison of token-classification versus seq2seq capacity, the absence of these details leaves open the possibility that observed gaps are artifacts of prompt engineering or post-hoc parsing rather than architectural differences.
Authors: We agree that these implementation details are essential for reproducibility and for supporting a fair architectural comparison. The prompt templates, output format constraints, and span-parsing procedure were described at a high level but not with sufficient specificity in the original submission. In the revised manuscript we will add a dedicated subsection that provides the exact prompt templates (including few-shot examples), the required output format (e.g., structured entity-type tags), and the deterministic parsing rules used to convert generated text into BIO-style spans. Concrete examples of prompts and model outputs will also be included. revision: yes
-
Referee: [Results and discussion] Results and discussion sections: no quantitative comparison is provided between the two T5 validation strategies, nor is it shown whether either strategy approaches the performance obtainable by standard supervised fine-tuning of T5 on the same NER data. This information is load-bearing for the claim that the reported T5 numbers reflect seq2seq model capacity.
Authors: We acknowledge that a side-by-side quantitative comparison of the two validation strategies is missing and will add it to the results section, reporting F1 scores, precision, and recall for both strategies under the 7-class and 3-class tag sets. Our study deliberately focuses on few-shot prompting rather than full supervised fine-tuning in order to examine seq2seq behavior under limited supervision. We did not run a standard full fine-tuning baseline due to computational constraints. In the revised discussion we will explicitly state this scope limitation, clarify that the reported T5 numbers reflect the few-shot regime, and note full fine-tuning as a natural direction for future work. revision: partial
Circularity Check
No circularity: empirical comparison of BERT and T5 on NER
full rationale
The paper reports an empirical study that fine-tunes BERT with weighted cross-entropy and T5 with few-shot prompts, evaluates both under 7-class and 3-class NER tag schemes, performs hyperparameter ablations, and analyzes errors. No derivation chain, first-principles prediction, or mathematical reduction is claimed or present. Performance metrics are obtained directly from held-out test data rather than being forced by construction from fitted parameters. No load-bearing self-citations, uniqueness theorems, or ansatzes imported from prior author work are referenced in the provided text. The central claims rest on observable experimental outcomes and are therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
free parameters (1)
- hyperparameters for fine-tuning and prompting
axioms (1)
- domain assumption Pretrained BERT and T5 weights provide a suitable starting point for NER fine-tuning
Lean theorems connected to this paper
-
IndisputableMonolith.Foundation.RealityFromDistinctionreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
BERT with weighted cross-entropy and T5 with few-shot prompts are compared on NER under original 7-class and simplified 3-class tag schemes, with ablation and error analysis
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
- [1]
-
[2]
IEEE Transactions on Knowledge and Data Engineering , volume =
Li, Jing and Sun, Aixin and Han, Jianglei and Li, Chenliang , title =. IEEE Transactions on Knowledge and Data Engineering , volume =. 2020 , doi =
work page 2020
-
[3]
Devlin, Jacob and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina , title =. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL) , pages =. 2019 , url =
work page 2019
- [4]
-
[5]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. https://arxiv.org/abs/1810.04805 BERT : Pre-training of deep bidirectional transformers for language understanding . In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL), pages 4171--4...
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[6]
Dan Jurafsky and James H. Martin. 2021. Rule-based methods. In Speech and Language Processing, 3rd ed. draft edition, chapter 17.7.1. Available at: https://web.stanford.edu/ jurafsky/slp3/17.pdf
work page 2021
-
[7]
Jing Li, Aixin Sun, Jianglei Han, and Chenliang Li. 2020. https://doi.org/10.1109/TKDE.2020.2981314 A survey on deep learning for named entity recognition . IEEE Transactions on Knowledge and Data Engineering, 34(1):50--70
-
[8]
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2020. https://arxiv.org/abs/1910.10683 Exploring the limits of transfer learning with a unified text-to-text transformer . Journal of Machine Learning Research, 21(140):1--67
work page internal anchor Pith review Pith/arXiv arXiv 2020
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.