pith. sign in

arxiv: 2605.18462 · v1 · pith:PQ3KF3A5new · submitted 2026-05-18 · 💻 cs.CL

From BERT to T5: A Study of Named Entity Recognition

Pith reviewed 2026-05-20 11:26 UTC · model grok-4.3

classification 💻 cs.CL
keywords named entity recognitionBERTT5few-shot promptingweighted cross-entropysequence labellingablation studyerror analysis
0
0 comments X

The pith

BERT with weighted cross-entropy and T5 with few-shot prompts show different strengths on named entity recognition under 7-class and 3-class tag schemes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to compare how an encoder-only model such as BERT and a sequence-to-sequence model such as T5 handle named entity recognition when each is adapted in a particular way. BERT receives a classification head and weighted cross-entropy loss, while T5 receives few-shot prompts and is tested with two validation strategies. Experiments run on both the original seven-class tags and a simplified three-class version, accompanied by hyperparameter ablations and error analysis. A sympathetic reader would care because the results can guide practical choices between these architectures for sequence-labelling problems in NLP pipelines.

Core claim

By fine-tuning BERT with a weighted cross-entropy loss and adapting T5 through few-shot prompts, then evaluating both under the original 7-class tag scheme and a simplified 3-class scheme, the study produces comparative performance metrics, ablation results, and error breakdowns that illuminate the relative abilities of the two architectures for sequence labelling.

What carries the argument

The BERT classifier head trained with weighted cross-entropy together with the T5 sequence-to-sequence model driven by few-shot prompts, applied to named-entity tag sequences.

If this is right

  • BERT may be the stronger default choice when the full set of entity classes must be distinguished.
  • T5 may suit settings where label sets are simplified or training examples are scarce.
  • Error patterns identified for each model can direct targeted fixes such as better boundary handling.
  • Ablation outcomes indicate which hyperparameters matter most for stable results from each architecture.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same comparison setup could be reused on other sequence-labelling problems such as chunking or semantic role labelling.
  • Results point toward testing whether the observed trade-offs persist on domain-specific or low-resource NER datasets.
  • Hybrid systems that route examples to the stronger model for a given tag granularity might improve overall accuracy.

Load-bearing premise

The chosen fine-tuning procedures, prompt formats, and validation strategies constitute representative and fair tests of each model's capacity for the NER task.

What would settle it

Repeating the experiments with alternate prompt formats or loss weights that reverse the observed performance ordering between BERT and T5 on the 7-class scheme would undermine the reported relative abilities.

Figures

Figures reproduced from arXiv: 2605.18462 by Mei Jia.

Figure 1
Figure 1. Figure 1: Heatmaps of tag distribution of 7-class tags (left) and 3-class tags (right) across training (Train), validation [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Plots of training and validation loss (left) and validation metrics(right) over 20 epochs for the performance [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Plots of training and validation loss (left) and validation metrics(right) over 20 epochs for the performance [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Plots of training loss (left) and validation metrics(right) over 15 epochs for the performance of T5 on [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Plots of training loss (left) and validation metrics(right) over 13 epochs for the performance of T5 on [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Plot of top error examples of BERT in the Test dataset with full tags. [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Plot of top error examples of BERT in the OOD dataset with full tags. [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Plot of error examples of BERT in the Test dataset with simple tags. [PITH_FULL_IMAGE:figures/full_fig_p011_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Plot of top error examples of BERT in the OOD dataset with simple tags. [PITH_FULL_IMAGE:figures/full_fig_p011_9.png] view at source ↗
read the original abstract

Named entity recognition (NER) has been one of the essential preliminary steps in modern NLP applications. This report focuses on implementing the NER task on finetuning two pretrained models: (i) an encoder-only model (BERT) with a simple classification head, and (ii) a sequence-to-sequence model (T5) with few-shot prompts. Under the original 7-class tag and 3-class simplified tag schemes, BERT is applied a weighted cross-entropy for training loss, and T5 is fine-tuned with two validation strategies. It also conducted an ablation study with different hyperparameters. Moreover, the related analysis provides valuable insights into common errors in BERT and the two models' performance. Based on a bunch of performance metrics, this report aims to compare the above two architectures and explore their abilities in the sequence labelling task, laying the groundwork for further practical use cases.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript compares an encoder-only model (BERT with a token classification head and weighted cross-entropy loss) against a seq2seq model (T5 with few-shot prompting) on named entity recognition under both the original 7-class and a simplified 3-class tag scheme. It reports results from hyperparameter ablations, two validation strategies for T5, and error analysis to draw conclusions about the relative abilities of the two architectures for sequence labeling.

Significance. If the T5 prompting, decoding, and span-extraction procedures are demonstrated to be representative instantiations, the work supplies practical empirical guidance on when encoder-only versus encoder-decoder models may be preferable for NER, together with concrete error patterns that practitioners can use when selecting architectures. The inclusion of both tag-scheme variants and ablation results strengthens the utility of the comparison.

major comments (2)
  1. [T5 experimental setup] Section describing T5 experimental setup: the prompt templates, output format constraints, and procedure for parsing generated text into entity spans are not specified. Because the central claim rests on a fair head-to-head comparison of token-classification versus seq2seq capacity, the absence of these details leaves open the possibility that observed gaps are artifacts of prompt engineering or post-hoc parsing rather than architectural differences.
  2. [Results and discussion] Results and discussion sections: no quantitative comparison is provided between the two T5 validation strategies, nor is it shown whether either strategy approaches the performance obtainable by standard supervised fine-tuning of T5 on the same NER data. This information is load-bearing for the claim that the reported T5 numbers reflect seq2seq model capacity.
minor comments (2)
  1. [Abstract] The abstract states that T5 is 'fine-tuned with two validation strategies' while the title and later text refer to 'few-shot prompts'; the terminology should be made consistent throughout.
  2. [Results] Table or figure reporting main results: error bars or standard deviations across runs are not mentioned; adding them would strengthen the reliability of the reported performance differences.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback on our manuscript comparing BERT and T5 for named entity recognition. We address each major comment below and indicate the revisions we will make to strengthen the paper.

read point-by-point responses
  1. Referee: [T5 experimental setup] Section describing T5 experimental setup: the prompt templates, output format constraints, and procedure for parsing generated text into entity spans are not specified. Because the central claim rests on a fair head-to-head comparison of token-classification versus seq2seq capacity, the absence of these details leaves open the possibility that observed gaps are artifacts of prompt engineering or post-hoc parsing rather than architectural differences.

    Authors: We agree that these implementation details are essential for reproducibility and for supporting a fair architectural comparison. The prompt templates, output format constraints, and span-parsing procedure were described at a high level but not with sufficient specificity in the original submission. In the revised manuscript we will add a dedicated subsection that provides the exact prompt templates (including few-shot examples), the required output format (e.g., structured entity-type tags), and the deterministic parsing rules used to convert generated text into BIO-style spans. Concrete examples of prompts and model outputs will also be included. revision: yes

  2. Referee: [Results and discussion] Results and discussion sections: no quantitative comparison is provided between the two T5 validation strategies, nor is it shown whether either strategy approaches the performance obtainable by standard supervised fine-tuning of T5 on the same NER data. This information is load-bearing for the claim that the reported T5 numbers reflect seq2seq model capacity.

    Authors: We acknowledge that a side-by-side quantitative comparison of the two validation strategies is missing and will add it to the results section, reporting F1 scores, precision, and recall for both strategies under the 7-class and 3-class tag sets. Our study deliberately focuses on few-shot prompting rather than full supervised fine-tuning in order to examine seq2seq behavior under limited supervision. We did not run a standard full fine-tuning baseline due to computational constraints. In the revised discussion we will explicitly state this scope limitation, clarify that the reported T5 numbers reflect the few-shot regime, and note full fine-tuning as a natural direction for future work. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical comparison of BERT and T5 on NER

full rationale

The paper reports an empirical study that fine-tunes BERT with weighted cross-entropy and T5 with few-shot prompts, evaluates both under 7-class and 3-class NER tag schemes, performs hyperparameter ablations, and analyzes errors. No derivation chain, first-principles prediction, or mathematical reduction is claimed or present. Performance metrics are obtained directly from held-out test data rather than being forced by construction from fitted parameters. No load-bearing self-citations, uniqueness theorems, or ansatzes imported from prior author work are referenced in the provided text. The central claims rest on observable experimental outcomes and are therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

No mathematical axioms or invented entities; the work rests on standard assumptions that pretrained transformer weights transfer to NER and that the chosen tag schemes and loss functions are appropriate.

free parameters (1)
  • hyperparameters for fine-tuning and prompting
    Learning rates, batch sizes, prompt formats, and validation strategies are selected and ablated but not enumerated in the abstract.
axioms (1)
  • domain assumption Pretrained BERT and T5 weights provide a suitable starting point for NER fine-tuning
    Implicit in the decision to fine-tune rather than train from scratch.

pith-pipeline@v0.9.0 · 5666 in / 1148 out tokens · 26166 ms · 2026-05-20T11:26:53.592337+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

8 extracted references · 8 canonical work pages · 2 internal anchors

  1. [1]

    , title =

    Jurafsky, Dan and Martin, James H. , title =. Speech and Language Processing , edition =. 2021 , note =

  2. [2]

    IEEE Transactions on Knowledge and Data Engineering , volume =

    Li, Jing and Sun, Aixin and Han, Jianglei and Li, Chenliang , title =. IEEE Transactions on Knowledge and Data Engineering , volume =. 2020 , doi =

  3. [3]

    Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL) , pages =

    Devlin, Jacob and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina , title =. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL) , pages =. 2019 , url =

  4. [4]

    , title =

    Raffel, Colin and Shazeer, Noam and Roberts, Adam and Lee, Katherine and Narang, Sharan and Matena, Michael and Zhou, Yanqi and Li, Wei and Liu, Peter J. , title =. Journal of Machine Learning Research , volume =. 2020 , url =

  5. [5]

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. https://arxiv.org/abs/1810.04805 BERT : Pre-training of deep bidirectional transformers for language understanding . In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL), pages 4171--4...

  6. [6]

    Dan Jurafsky and James H. Martin. 2021. Rule-based methods. In Speech and Language Processing, 3rd ed. draft edition, chapter 17.7.1. Available at: https://web.stanford.edu/ jurafsky/slp3/17.pdf

  7. [7]

    Jing Li, Aixin Sun, Jianglei Han, and Chenliang Li. 2020. https://doi.org/10.1109/TKDE.2020.2981314 A survey on deep learning for named entity recognition . IEEE Transactions on Knowledge and Data Engineering, 34(1):50--70

  8. [8]

    Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2020. https://arxiv.org/abs/1910.10683 Exploring the limits of transfer learning with a unified text-to-text transformer . Journal of Machine Learning Research, 21(140):1--67