From BERT to T5: A Study of Named Entity Recognition

Mei Jia

arxiv: 2605.18462 · v1 · pith:PQ3KF3A5new · submitted 2026-05-18 · 💻 cs.CL

From BERT to T5: A Study of Named Entity Recognition

Mei Jia This is my paper

Pith reviewed 2026-05-20 11:26 UTC · model grok-4.3

classification 💻 cs.CL

keywords named entity recognitionBERTT5few-shot promptingweighted cross-entropysequence labellingablation studyerror analysis

0 comments

The pith

BERT with weighted cross-entropy and T5 with few-shot prompts show different strengths on named entity recognition under 7-class and 3-class tag schemes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to compare how an encoder-only model such as BERT and a sequence-to-sequence model such as T5 handle named entity recognition when each is adapted in a particular way. BERT receives a classification head and weighted cross-entropy loss, while T5 receives few-shot prompts and is tested with two validation strategies. Experiments run on both the original seven-class tags and a simplified three-class version, accompanied by hyperparameter ablations and error analysis. A sympathetic reader would care because the results can guide practical choices between these architectures for sequence-labelling problems in NLP pipelines.

Core claim

By fine-tuning BERT with a weighted cross-entropy loss and adapting T5 through few-shot prompts, then evaluating both under the original 7-class tag scheme and a simplified 3-class scheme, the study produces comparative performance metrics, ablation results, and error breakdowns that illuminate the relative abilities of the two architectures for sequence labelling.

What carries the argument

The BERT classifier head trained with weighted cross-entropy together with the T5 sequence-to-sequence model driven by few-shot prompts, applied to named-entity tag sequences.

If this is right

BERT may be the stronger default choice when the full set of entity classes must be distinguished.
T5 may suit settings where label sets are simplified or training examples are scarce.
Error patterns identified for each model can direct targeted fixes such as better boundary handling.
Ablation outcomes indicate which hyperparameters matter most for stable results from each architecture.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same comparison setup could be reused on other sequence-labelling problems such as chunking or semantic role labelling.
Results point toward testing whether the observed trade-offs persist on domain-specific or low-resource NER datasets.
Hybrid systems that route examples to the stronger model for a given tag granularity might improve overall accuracy.

Load-bearing premise

The chosen fine-tuning procedures, prompt formats, and validation strategies constitute representative and fair tests of each model's capacity for the NER task.

What would settle it

Repeating the experiments with alternate prompt formats or loss weights that reverse the observed performance ordering between BERT and T5 on the 7-class scheme would undermine the reported relative abilities.

Figures

Figures reproduced from arXiv: 2605.18462 by Mei Jia.

**Figure 1.** Figure 1: Heatmaps of tag distribution of 7-class tags (left) and 3-class tags (right) across training (Train), validation [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: Plots of training and validation loss (left) and validation metrics(right) over 20 epochs for the performance [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 3.** Figure 3: Plots of training and validation loss (left) and validation metrics(right) over 20 epochs for the performance [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Plots of training loss (left) and validation metrics(right) over 15 epochs for the performance of T5 on [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Plots of training loss (left) and validation metrics(right) over 13 epochs for the performance of T5 on [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: Plot of top error examples of BERT in the Test dataset with full tags. [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗

**Figure 7.** Figure 7: Plot of top error examples of BERT in the OOD dataset with full tags. [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗

**Figure 8.** Figure 8: Plot of error examples of BERT in the Test dataset with simple tags. [PITH_FULL_IMAGE:figures/full_fig_p011_8.png] view at source ↗

**Figure 9.** Figure 9: Plot of top error examples of BERT in the OOD dataset with simple tags. [PITH_FULL_IMAGE:figures/full_fig_p011_9.png] view at source ↗

read the original abstract

Named entity recognition (NER) has been one of the essential preliminary steps in modern NLP applications. This report focuses on implementing the NER task on finetuning two pretrained models: (i) an encoder-only model (BERT) with a simple classification head, and (ii) a sequence-to-sequence model (T5) with few-shot prompts. Under the original 7-class tag and 3-class simplified tag schemes, BERT is applied a weighted cross-entropy for training loss, and T5 is fine-tuned with two validation strategies. It also conducted an ablation study with different hyperparameters. Moreover, the related analysis provides valuable insights into common errors in BERT and the two models' performance. Based on a bunch of performance metrics, this report aims to compare the above two architectures and explore their abilities in the sequence labelling task, laying the groundwork for further practical use cases.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

A straightforward implementation comparison of BERT and T5 on NER that includes ablations and error analysis but stays incremental.

read the letter

The main takeaway is that this paper runs a side-by-side comparison of fine-tuned BERT with a classification head and weighted cross-entropy loss against T5 using few-shot prompts, testing both the original 7-class and simplified 3-class tag schemes on NER data, with hyperparameter ablations and some error breakdown. It is mostly a report on applying existing techniques rather than a new method or theory. What it does reasonably well is document the setups clearly enough to show practical differences and pull out common error patterns in the BERT runs, which can give implementers a sense of where token classification struggles on sequence labeling. The ablation section adds a bit of concrete data on hyperparameter sensitivity. The soft spots are the limited novelty and the T5 side. The prompting and output parsing for span extraction are described only at a high level in the abstract, so any performance gap could partly reflect how the generated text is decoded rather than pure model capacity; if the full text does not include the exact prompt templates and decoding rules, that weakens the relative-ability claims. The work is confirmatory and does not reduce to fitted parameters by construction, but it also does not ship new derivations or large-scale external validation. This is the sort of paper that might help a practitioner or student choose between these two architectures for a standard NER pipeline and get some error-analysis tips, but it will not shift broader understanding of encoder versus seq2seq models. The experimental approach looks standard and reproducible if code and data are available. I would not cite it in my own work. It is coherent on its own terms and shows honest engagement with the setups, so it deserves a serious referee for an applied or reproducibility-focused venue rather than a desk reject.

Referee Report

2 major / 2 minor

Summary. The manuscript compares an encoder-only model (BERT with a token classification head and weighted cross-entropy loss) against a seq2seq model (T5 with few-shot prompting) on named entity recognition under both the original 7-class and a simplified 3-class tag scheme. It reports results from hyperparameter ablations, two validation strategies for T5, and error analysis to draw conclusions about the relative abilities of the two architectures for sequence labeling.

Significance. If the T5 prompting, decoding, and span-extraction procedures are demonstrated to be representative instantiations, the work supplies practical empirical guidance on when encoder-only versus encoder-decoder models may be preferable for NER, together with concrete error patterns that practitioners can use when selecting architectures. The inclusion of both tag-scheme variants and ablation results strengthens the utility of the comparison.

major comments (2)

[T5 experimental setup] Section describing T5 experimental setup: the prompt templates, output format constraints, and procedure for parsing generated text into entity spans are not specified. Because the central claim rests on a fair head-to-head comparison of token-classification versus seq2seq capacity, the absence of these details leaves open the possibility that observed gaps are artifacts of prompt engineering or post-hoc parsing rather than architectural differences.
[Results and discussion] Results and discussion sections: no quantitative comparison is provided between the two T5 validation strategies, nor is it shown whether either strategy approaches the performance obtainable by standard supervised fine-tuning of T5 on the same NER data. This information is load-bearing for the claim that the reported T5 numbers reflect seq2seq model capacity.

minor comments (2)

[Abstract] The abstract states that T5 is 'fine-tuned with two validation strategies' while the title and later text refer to 'few-shot prompts'; the terminology should be made consistent throughout.
[Results] Table or figure reporting main results: error bars or standard deviations across runs are not mentioned; adding them would strengthen the reliability of the reported performance differences.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback on our manuscript comparing BERT and T5 for named entity recognition. We address each major comment below and indicate the revisions we will make to strengthen the paper.

read point-by-point responses

Referee: [T5 experimental setup] Section describing T5 experimental setup: the prompt templates, output format constraints, and procedure for parsing generated text into entity spans are not specified. Because the central claim rests on a fair head-to-head comparison of token-classification versus seq2seq capacity, the absence of these details leaves open the possibility that observed gaps are artifacts of prompt engineering or post-hoc parsing rather than architectural differences.

Authors: We agree that these implementation details are essential for reproducibility and for supporting a fair architectural comparison. The prompt templates, output format constraints, and span-parsing procedure were described at a high level but not with sufficient specificity in the original submission. In the revised manuscript we will add a dedicated subsection that provides the exact prompt templates (including few-shot examples), the required output format (e.g., structured entity-type tags), and the deterministic parsing rules used to convert generated text into BIO-style spans. Concrete examples of prompts and model outputs will also be included. revision: yes
Referee: [Results and discussion] Results and discussion sections: no quantitative comparison is provided between the two T5 validation strategies, nor is it shown whether either strategy approaches the performance obtainable by standard supervised fine-tuning of T5 on the same NER data. This information is load-bearing for the claim that the reported T5 numbers reflect seq2seq model capacity.

Authors: We acknowledge that a side-by-side quantitative comparison of the two validation strategies is missing and will add it to the results section, reporting F1 scores, precision, and recall for both strategies under the 7-class and 3-class tag sets. Our study deliberately focuses on few-shot prompting rather than full supervised fine-tuning in order to examine seq2seq behavior under limited supervision. We did not run a standard full fine-tuning baseline due to computational constraints. In the revised discussion we will explicitly state this scope limitation, clarify that the reported T5 numbers reflect the few-shot regime, and note full fine-tuning as a natural direction for future work. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical comparison of BERT and T5 on NER

full rationale

The paper reports an empirical study that fine-tunes BERT with weighted cross-entropy and T5 with few-shot prompts, evaluates both under 7-class and 3-class NER tag schemes, performs hyperparameter ablations, and analyzes errors. No derivation chain, first-principles prediction, or mathematical reduction is claimed or present. Performance metrics are obtained directly from held-out test data rather than being forced by construction from fitted parameters. No load-bearing self-citations, uniqueness theorems, or ansatzes imported from prior author work are referenced in the provided text. The central claims rest on observable experimental outcomes and are therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

No mathematical axioms or invented entities; the work rests on standard assumptions that pretrained transformer weights transfer to NER and that the chosen tag schemes and loss functions are appropriate.

free parameters (1)

hyperparameters for fine-tuning and prompting
Learning rates, batch sizes, prompt formats, and validation strategies are selected and ablated but not enumerated in the abstract.

axioms (1)

domain assumption Pretrained BERT and T5 weights provide a suitable starting point for NER fine-tuning
Implicit in the decision to fine-tune rather than train from scratch.

pith-pipeline@v0.9.0 · 5666 in / 1148 out tokens · 26166 ms · 2026-05-20T11:26:53.592337+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith.Foundation.RealityFromDistinction reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

BERT with weighted cross-entropy and T5 with few-shot prompts are compared on NER under original 7-class and simplified 3-class tag schemes, with ablation and error analysis

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

8 extracted references · 8 canonical work pages · 2 internal anchors

[1]

, title =

Jurafsky, Dan and Martin, James H. , title =. Speech and Language Processing , edition =. 2021 , note =

work page 2021
[2]

IEEE Transactions on Knowledge and Data Engineering , volume =

Li, Jing and Sun, Aixin and Han, Jianglei and Li, Chenliang , title =. IEEE Transactions on Knowledge and Data Engineering , volume =. 2020 , doi =

work page 2020
[3]

Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL) , pages =

Devlin, Jacob and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina , title =. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL) , pages =. 2019 , url =

work page 2019
[4]

, title =

Raffel, Colin and Shazeer, Noam and Roberts, Adam and Lee, Katherine and Narang, Sharan and Matena, Michael and Zhou, Yanqi and Li, Wei and Liu, Peter J. , title =. Journal of Machine Learning Research , volume =. 2020 , url =

work page 2020
[5]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. https://arxiv.org/abs/1810.04805 BERT : Pre-training of deep bidirectional transformers for language understanding . In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL), pages 4171--4...

work page internal anchor Pith review Pith/arXiv arXiv 2019
[6]

Dan Jurafsky and James H. Martin. 2021. Rule-based methods. In Speech and Language Processing, 3rd ed. draft edition, chapter 17.7.1. Available at: https://web.stanford.edu/ jurafsky/slp3/17.pdf

work page 2021
[7]

Jing Li, Aixin Sun, Jianglei Han, and Chenliang Li. 2020. https://doi.org/10.1109/TKDE.2020.2981314 A survey on deep learning for named entity recognition . IEEE Transactions on Knowledge and Data Engineering, 34(1):50--70

work page doi:10.1109/tkde.2020.2981314 2020
[8]

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2020. https://arxiv.org/abs/1910.10683 Exploring the limits of transfer learning with a unified text-to-text transformer . Journal of Machine Learning Research, 21(140):1--67

work page internal anchor Pith review Pith/arXiv arXiv 2020

[1] [1]

, title =

Jurafsky, Dan and Martin, James H. , title =. Speech and Language Processing , edition =. 2021 , note =

work page 2021

[2] [2]

IEEE Transactions on Knowledge and Data Engineering , volume =

Li, Jing and Sun, Aixin and Han, Jianglei and Li, Chenliang , title =. IEEE Transactions on Knowledge and Data Engineering , volume =. 2020 , doi =

work page 2020

[3] [3]

Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL) , pages =

Devlin, Jacob and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina , title =. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL) , pages =. 2019 , url =

work page 2019

[4] [4]

, title =

Raffel, Colin and Shazeer, Noam and Roberts, Adam and Lee, Katherine and Narang, Sharan and Matena, Michael and Zhou, Yanqi and Li, Wei and Liu, Peter J. , title =. Journal of Machine Learning Research , volume =. 2020 , url =

work page 2020

[5] [5]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. https://arxiv.org/abs/1810.04805 BERT : Pre-training of deep bidirectional transformers for language understanding . In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL), pages 4171--4...

work page internal anchor Pith review Pith/arXiv arXiv 2019

[6] [6]

Dan Jurafsky and James H. Martin. 2021. Rule-based methods. In Speech and Language Processing, 3rd ed. draft edition, chapter 17.7.1. Available at: https://web.stanford.edu/ jurafsky/slp3/17.pdf

work page 2021

[7] [7]

Jing Li, Aixin Sun, Jianglei Han, and Chenliang Li. 2020. https://doi.org/10.1109/TKDE.2020.2981314 A survey on deep learning for named entity recognition . IEEE Transactions on Knowledge and Data Engineering, 34(1):50--70

work page doi:10.1109/tkde.2020.2981314 2020

[8] [8]

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2020. https://arxiv.org/abs/1910.10683 Exploring the limits of transfer learning with a unified text-to-text transformer . Journal of Machine Learning Research, 21(140):1--67

work page internal anchor Pith review Pith/arXiv arXiv 2020