Analysing Lightweight Large Language Models for Biomedical Named Entity Recognition on Diverse Ouput Formats
Pith reviewed 2026-05-14 23:35 UTC · model grok-4.3
The pith
Lightweight LLMs can match larger models in biomedical named entity recognition when output formats are chosen well.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Lightweight LLMs achieve competitive performance compared to larger models in Biomedical Named Entity Recognition, with performance varying by output format but not improved by instruction tuning over many formats.
What carries the argument
Evaluation of how different output formats affect the accuracy of lightweight LLMs on biomedical named entity recognition tasks.
If this is right
- Lightweight models become practical choices for biomedical information extraction in settings with limited compute or strict privacy rules.
- Certain output formats should be prioritized when prompting these models for named entity recognition.
- Broad instruction tuning across many formats is not required to reach strong performance.
- Healthcare applications can rely on smaller models to perform entity recognition without large infrastructure.
Where Pith is reading between the lines
- Local deployment of these models could lower dependence on external servers and better protect patient data in clinical environments.
- The best-performing formats identified here may transfer to other biomedical text tasks such as relation extraction.
- Further tests could isolate whether particular model families among the lightweight group respond differently to the same formats.
Load-bearing premise
The performance differences observed are driven primarily by output format choices rather than other factors such as specific model selection or dataset details.
What would settle it
Re-running the experiments on a new biomedical dataset or with a different collection of lightweight models and checking whether the same output formats still rank highest would confirm or refute the central claim.
read the original abstract
Despite their strong linguistic capabilities, Large Language Models (LLMs) are computationally demanding and require substantial resources for fine-tuning, which is unadapted to privacy and budget constraints of many healthcare settings. To address this, we present an experimental analysis focused on Biomedical Named Entity Recognition using lightweight LLMs, we evaluate the impact of different output formats on model performance. The results reveal that lightweight LLMs can achieve competitive performance compared to the larger models, highlighting their potential as lightweight yet effective alternatives for biomedical information extraction. Our analysis shows that instruction tuning over many distinct formats does not improve performance, but identifies several format consistently associated with better performance.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents an experimental analysis of lightweight LLMs for biomedical named entity recognition, evaluating the effects of different output formats on performance. It claims that lightweight models achieve competitive results relative to larger models, that instruction tuning across many formats does not improve outcomes, and that certain formats are consistently associated with better performance.
Significance. If the empirical comparisons hold under matched conditions, the work would offer practical value for resource-limited biomedical settings by demonstrating viable lightweight alternatives and format guidelines for information extraction tasks.
major comments (2)
- [Abstract] The abstract and experimental description supply no dataset names, sizes, splits, metrics (e.g., exact F1 definition), statistical tests, or baseline implementation details, preventing verification of the central claim that lightweight LLMs achieve competitive performance.
- [Experimental Setup] The comparison to larger models does not confirm identical fine-tuning regimes, instruction-tuning protocols, or dataset handling for all models; without this, performance differences cannot be securely attributed to model scale or output format rather than training disparities.
minor comments (2)
- [Abstract] Abstract contains a grammatical issue ('using lightweight LLMs, we evaluate') and a typo ('Ouput' for 'Output').
- [Methods] Notation for output formats is introduced without a clear table or enumeration of the exact formats tested.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address each major comment below and indicate the revisions we will make to improve clarity and verifiability.
read point-by-point responses
-
Referee: [Abstract] The abstract and experimental description supply no dataset names, sizes, splits, metrics (e.g., exact F1 definition), statistical tests, or baseline implementation details, preventing verification of the central claim that lightweight LLMs achieve competitive performance.
Authors: We agree that the abstract is too concise and omits essential details required for verification. In the revised manuscript we will expand the abstract to name the datasets (BC5CDR, NCBI Disease, JNLPBA), report their sizes and standard splits, specify the evaluation metric as micro-averaged F1, note the use of paired statistical tests for significance, and reference the shared fine-tuning protocol and libraries employed for all baselines. revision: yes
-
Referee: [Experimental Setup] The comparison to larger models does not confirm identical fine-tuning regimes, instruction-tuning protocols, or dataset handling for all models; without this, performance differences cannot be securely attributed to model scale or output format rather than training disparities.
Authors: We acknowledge the need for explicit confirmation. All models, including the larger ones, were trained under an identical instruction-tuning regime using the same dataset splits, learning rate, batch size, epochs, and prompt templates. We will insert a dedicated paragraph in the Experimental Setup section that tabulates these shared hyperparameters and states that only model scale and output format were varied, thereby ruling out training disparities as a confounding factor. revision: yes
Circularity Check
No circularity: purely empirical evaluation with no derivations
full rationale
This paper conducts an experimental analysis of lightweight LLMs for biomedical NER across output formats, reporting direct performance metrics from model runs. It contains no equations, derivations, fitted parameters presented as predictions, or mathematical claims that could reduce to inputs by construction. Central results rest on empirical comparisons rather than self-definitional steps, self-citation chains, or ansatzes smuggled via prior work. The study is self-contained against its own benchmarks with no load-bearing circular elements.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Introduction Generative Named Entity Recognition (G-NER) of- fers a promising paradigm shift from traditional span-based or classification-based approaches by framing entity extraction as a text generation task (Xu et al., 2024a). In this setting, structured prediction plays a central role as it enables models to generate outputs with predefined structure...
work page 2024
-
[2]
Related Work Since the rise of LLMs there has been remarkable progress in the task of IE. This section discusses mostrecentmethodsforNERbasedonLLMs. Oth- ers are excluded for the consistency of comparison andspacelimitation. SeeXuetal.(Xuetal.,2024b) for a recent and detailed survey. Prompting based Approaches.Zero-shot Infor- mation Extraction (ZIE) aims...
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[3]
distills ChatGPT into smaller student mod- els through mission-focused instruction tuning for open NER, while Ding et al. (Ding et al., 2024) in- corporates negative instances in generative NER training onThe Pileopen-source corpus, improv- ing zero-shot performance on unseen entity do- mains. Supervised fine-tuning approaches further enhance model capabi...
work page 2024
-
[4]
Methodology In this work, we focus on Causal Language Models (CLMs), which generate text autoregressively by predicting each token given its preceding context. Trained with the Next Token Prediction (NTP) ob- jective, CLMs naturally align with text generation tasks, making them well-suited for instruction tun- ing. WeadoptthisframeworktotreatNERasatext ge...
work page 2024
-
[5]
Datasets WeselectedeightBioNERdatasetsforouranalysis
Experimental Setting 4.1. Datasets WeselectedeightBioNERdatasetsforouranalysis. See Table 1 for detailed statistics of each dataset. AnatEM(Pyysalo and Ananiadou, 2014) contains PubMed abstracts, annotated for anatomical enti- ties, including organs, tissues, and body parts. BioCreative II Gene Mention (BC2GM)(Smith et al., 2008) is a PubMed-based corpus ...
work page 2014
-
[6]
disease" when the only possible label of AnatEM is “anatomy
Results The experimental results obtained indicate two pri- mary findings. First, it is observed that despite the discrepancy in model size between the two base models (500M for Qwen-2.5 and 1B for Llama-3.2), theperformancedoesnotsignificantlydecline. This is supported by the fact that the same applies to the baselines considered, which are all based on ...
-
[7]
Conclusion This study demonstrates that instruction-tuned, lightweight LLMs can achieve competitive perfor- mance in biomedical G-NER, challenging the dom- inance of larger-scale models. We identified the most effective output formats (formats conv_term and multi_triple) for representing biomedical enti- ties, including complex cases such as nested and di...
-
[8]
InFindings of the Asso- ciationforComputationalLinguistics,ACL,pages 15992–16030
Proggen: Generating named entity recog- nition datasets step-by-step with self-reflexive large language models. InFindings of the Asso- ciationforComputationalLinguistics,ACL,pages 15992–16030. Association for Computational Linguistics. Xuming Hu, Yong Jiang, Aiwei Liu, Zhongqiang Huang, Pengjun Xie, Fei Huang, Lijie Wen, and Philip S. Yu. 2023. Entity-to...
-
[9]
LLaMA: Open and Efficient Foundation Language Models
Overview of biocreative ii gene mention recognition.Genome biology, 9(Suppl 2):S2. Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. 2023. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971. Chenguang Wang,...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[10]
Qwen2 technical report.arXiv preprint arXiv:2407.10671. Urchade Zaratiana, Nadi Tomeh, Pierre Holat, and Thierry Charnois. 2024. An autoregressive text- to-graphframeworkforjointentityandrelationex- traction. InThirty-Eighth AAAI Conference on Ar- tificial Intelligence, AAAI 2024, Thirty-Sixth Con- ference on Innovative Applications of Artificial Intellig...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[11]
OpenReview.net. A. Appendix A.1. Training hyperparameters Table 8 lists the most important hyperparameters used for training. These were chosen to align with previous works, as explained above. The complete list of hyperparameters can be found on the GitHub repository: Training hyperparameters #epoch 15 #GPU 4 train batch size 2 gradient accumulation 8 le...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.