Analysing Lightweight Large Language Models for Biomedical Named Entity Recognition on Diverse Ouput Formats

Adrien Coulet (HeKA | U1346); DIG); Mehwish Alam (IP Paris; Pierre Epron (HeKA | U1346

arxiv: 2604.25920 · v1 · submitted 2026-03-27 · 💻 cs.CL · cs.AI

Analysing Lightweight Large Language Models for Biomedical Named Entity Recognition on Diverse Ouput Formats

Pierre Epron (HeKA | U1346 , DIG) , Adrien Coulet (HeKA | U1346) , Mehwish Alam (IP Paris This is my paper

Pith reviewed 2026-05-14 23:35 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords biomedical named entity recognitionlightweight large language modelsoutput formatsinstruction tuningbiomedical information extractionmodel performancehealthcare NLP

0 comments

The pith

Lightweight LLMs can match larger models in biomedical named entity recognition when output formats are chosen well.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests lightweight large language models on recognizing biomedical named entities such as diseases or drugs in text. It measures how different ways of formatting the model's output affect accuracy and compares results against much larger models. The work finds that these smaller models reach competitive performance levels, especially with certain output formats, which supports their use where computing resources or data privacy rules limit access to big models. Instruction tuning across many formats brings no overall gain, yet a few formats reliably produce stronger results.

Core claim

Lightweight LLMs achieve competitive performance compared to larger models in Biomedical Named Entity Recognition, with performance varying by output format but not improved by instruction tuning over many formats.

What carries the argument

Evaluation of how different output formats affect the accuracy of lightweight LLMs on biomedical named entity recognition tasks.

If this is right

Lightweight models become practical choices for biomedical information extraction in settings with limited compute or strict privacy rules.
Certain output formats should be prioritized when prompting these models for named entity recognition.
Broad instruction tuning across many formats is not required to reach strong performance.
Healthcare applications can rely on smaller models to perform entity recognition without large infrastructure.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Local deployment of these models could lower dependence on external servers and better protect patient data in clinical environments.
The best-performing formats identified here may transfer to other biomedical text tasks such as relation extraction.
Further tests could isolate whether particular model families among the lightweight group respond differently to the same formats.

Load-bearing premise

The performance differences observed are driven primarily by output format choices rather than other factors such as specific model selection or dataset details.

What would settle it

Re-running the experiments on a new biomedical dataset or with a different collection of lightweight models and checking whether the same output formats still rank highest would confirm or refute the central claim.

read the original abstract

Despite their strong linguistic capabilities, Large Language Models (LLMs) are computationally demanding and require substantial resources for fine-tuning, which is unadapted to privacy and budget constraints of many healthcare settings. To address this, we present an experimental analysis focused on Biomedical Named Entity Recognition using lightweight LLMs, we evaluate the impact of different output formats on model performance. The results reveal that lightweight LLMs can achieve competitive performance compared to the larger models, highlighting their potential as lightweight yet effective alternatives for biomedical information extraction. Our analysis shows that instruction tuning over many distinct formats does not improve performance, but identifies several format consistently associated with better performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Lightweight LLMs reach competitive BioNER scores with certain output formats and multi-format tuning adds little, but the larger-model comparisons need matched fine-tuning controls to hold up.

read the letter

The main thing to know is that this paper tests how output format affects a handful of lightweight LLMs on biomedical named entity recognition and reports that some formats work better while training across many formats at once brings no gain. They also claim the small models can match larger ones under the right setup. That is the core empirical result they deliver. The work is useful because it focuses on real constraints in healthcare deployments where you cannot afford big models or send data out. By running the same task across formats like JSON, XML, and plain text on several small models, they give practitioners a short list of formats worth trying first. The experiments are purely empirical and stay within standard BioNER setups, which keeps the claims grounded in observable performance numbers rather than new theory. The soft spot is the comparison to larger models. If those were not fine-tuned with the identical instruction protocol and dataset splits, then any apparent competitiveness could come from differences in training rather than scale or format choice. The abstract leaves out dataset names, exact metrics, and statistical tests, so the strength of the numbers is hard to judge until the tables are checked. This paper is for applied biomedical NLP people who need to run extraction locally under tight compute or privacy rules. It does not introduce new methods or resolve open questions, so it will not change how most researchers think about LLMs, but the format findings could be cited in deployment papers if the controls are clean. I would send it to peer review because the practical question is timely and the experiments are reproducible in principle, even if revisions will likely be needed to tighten the baseline comparisons and reporting.

Referee Report

2 major / 2 minor

Summary. The paper presents an experimental analysis of lightweight LLMs for biomedical named entity recognition, evaluating the effects of different output formats on performance. It claims that lightweight models achieve competitive results relative to larger models, that instruction tuning across many formats does not improve outcomes, and that certain formats are consistently associated with better performance.

Significance. If the empirical comparisons hold under matched conditions, the work would offer practical value for resource-limited biomedical settings by demonstrating viable lightweight alternatives and format guidelines for information extraction tasks.

major comments (2)

[Abstract] The abstract and experimental description supply no dataset names, sizes, splits, metrics (e.g., exact F1 definition), statistical tests, or baseline implementation details, preventing verification of the central claim that lightweight LLMs achieve competitive performance.
[Experimental Setup] The comparison to larger models does not confirm identical fine-tuning regimes, instruction-tuning protocols, or dataset handling for all models; without this, performance differences cannot be securely attributed to model scale or output format rather than training disparities.

minor comments (2)

[Abstract] Abstract contains a grammatical issue ('using lightweight LLMs, we evaluate') and a typo ('Ouput' for 'Output').
[Methods] Notation for output formats is introduced without a clear table or enumeration of the exact formats tested.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and indicate the revisions we will make to improve clarity and verifiability.

read point-by-point responses

Referee: [Abstract] The abstract and experimental description supply no dataset names, sizes, splits, metrics (e.g., exact F1 definition), statistical tests, or baseline implementation details, preventing verification of the central claim that lightweight LLMs achieve competitive performance.

Authors: We agree that the abstract is too concise and omits essential details required for verification. In the revised manuscript we will expand the abstract to name the datasets (BC5CDR, NCBI Disease, JNLPBA), report their sizes and standard splits, specify the evaluation metric as micro-averaged F1, note the use of paired statistical tests for significance, and reference the shared fine-tuning protocol and libraries employed for all baselines. revision: yes
Referee: [Experimental Setup] The comparison to larger models does not confirm identical fine-tuning regimes, instruction-tuning protocols, or dataset handling for all models; without this, performance differences cannot be securely attributed to model scale or output format rather than training disparities.

Authors: We acknowledge the need for explicit confirmation. All models, including the larger ones, were trained under an identical instruction-tuning regime using the same dataset splits, learning rate, batch size, epochs, and prompt templates. We will insert a dedicated paragraph in the Experimental Setup section that tabulates these shared hyperparameters and states that only model scale and output format were varied, thereby ruling out training disparities as a confounding factor. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical evaluation with no derivations

full rationale

This paper conducts an experimental analysis of lightweight LLMs for biomedical NER across output formats, reporting direct performance metrics from model runs. It contains no equations, derivations, fitted parameters presented as predictions, or mathematical claims that could reduce to inputs by construction. Central results rest on empirical comparisons rather than self-definitional steps, self-citation chains, or ansatzes smuggled via prior work. The study is self-contained against its own benchmarks with no load-bearing circular elements.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No mathematical model, free parameters, or invented entities; the work is an empirical performance study relying on standard assumptions of supervised evaluation in NLP.

pith-pipeline@v0.9.0 · 5421 in / 922 out tokens · 39368 ms · 2026-05-14T23:35:47.199270+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

11 extracted references · 11 canonical work pages · 3 internal anchors

[1]

In this setting, structured prediction plays a central role as it enables models to generate outputs with predefined structures

Introduction Generative Named Entity Recognition (G-NER) of- fers a promising paradigm shift from traditional span-based or classification-based approaches by framing entity extraction as a text generation task (Xu et al., 2024a). In this setting, structured prediction plays a central role as it enables models to generate outputs with predefined structure...

work page 2024
[2]

Analysing Lightweight Large Language Models for Biomedical Named Entity Recognition on Diverse Ouput Formats

Related Work Since the rise of LLMs there has been remarkable progress in the task of IE. This section discusses mostrecentmethodsforNERbasedonLLMs. Oth- ers are excluded for the consistency of comparison andspacelimitation. SeeXuetal.(Xuetal.,2024b) for a recent and detailed survey. Prompting based Approaches.Zero-shot Infor- mation Extraction (ZIE) aims...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[3]

(Ding et al., 2024) in- corporates negative instances in generative NER training onThe Pileopen-source corpus, improv- ing zero-shot performance on unseen entity do- mains

distills ChatGPT into smaller student mod- els through mission-focused instruction tuning for open NER, while Ding et al. (Ding et al., 2024) in- corporates negative instances in generative NER training onThe Pileopen-source corpus, improv- ing zero-shot performance on unseen entity do- mains. Supervised fine-tuning approaches further enhance model capabi...

work page 2024
[4]

lung tumor

Methodology In this work, we focus on Causal Language Models (CLMs), which generate text autoregressively by predicting each token given its preceding context. Trained with the Next Token Prediction (NTP) ob- jective, CLMs naturally align with text generation tasks, making them well-suited for instruction tun- ing. WeadoptthisframeworktotreatNERasatext ge...

work page 2024
[5]

Datasets WeselectedeightBioNERdatasetsforouranalysis

Experimental Setting 4.1. Datasets WeselectedeightBioNERdatasetsforouranalysis. See Table 1 for detailed statistics of each dataset. AnatEM(Pyysalo and Ananiadou, 2014) contains PubMed abstracts, annotated for anatomical enti- ties, including organs, tissues, and body parts. BioCreative II Gene Mention (BC2GM)(Smith et al., 2008) is a PubMed-based corpus ...

work page 2014
[6]

disease" when the only possible label of AnatEM is “anatomy

Results The experimental results obtained indicate two pri- mary findings. First, it is observed that despite the discrepancy in model size between the two base models (500M for Qwen-2.5 and 1B for Llama-3.2), theperformancedoesnotsignificantlydecline. This is supported by the fact that the same applies to the baselines considered, which are all based on ...

work page
[7]

Conclusion This study demonstrates that instruction-tuned, lightweight LLMs can achieve competitive perfor- mance in biomedical G-NER, challenging the dom- inance of larger-scale models. We identified the most effective output formats (formats conv_term and multi_triple) for representing biomedical enti- ties, including complex cases such as nested and di...

work page arXiv 2023
[8]

InFindings of the Asso- ciationforComputationalLinguistics,ACL,pages 15992–16030

Proggen: Generating named entity recog- nition datasets step-by-step with self-reflexive large language models. InFindings of the Asso- ciationforComputationalLinguistics,ACL,pages 15992–16030. Association for Computational Linguistics. Xuming Hu, Yong Jiang, Aiwei Liu, Zhongqiang Huang, Pengjun Xie, Fei Huang, Lijie Wen, and Philip S. Yu. 2023. Entity-to...

work page arXiv 2023
[9]

LLaMA: Open and Efficient Foundation Language Models

Overview of biocreative ii gene mention recognition.Genome biology, 9(Suppl 2):S2. Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. 2023. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971. Chenguang Wang,...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[10]

Qwen2 Technical Report

Qwen2 technical report.arXiv preprint arXiv:2407.10671. Urchade Zaratiana, Nadi Tomeh, Pierre Holat, and Thierry Charnois. 2024. An autoregressive text- to-graphframeworkforjointentityandrelationex- traction. InThirty-Eighth AAAI Conference on Ar- tificial Intelligence, AAAI 2024, Thirty-Sixth Con- ference on Innovative Applications of Artificial Intellig...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[11]

OpenReview.net. A. Appendix A.1. Training hyperparameters Table 8 lists the most important hyperparameters used for training. These were chosen to align with previous works, as explained above. The complete list of hyperparameters can be found on the GitHub repository: Training hyperparameters #epoch 15 #GPU 4 train batch size 2 gradient accumulation 8 le...

work page

[1] [1]

In this setting, structured prediction plays a central role as it enables models to generate outputs with predefined structures

Introduction Generative Named Entity Recognition (G-NER) of- fers a promising paradigm shift from traditional span-based or classification-based approaches by framing entity extraction as a text generation task (Xu et al., 2024a). In this setting, structured prediction plays a central role as it enables models to generate outputs with predefined structure...

work page 2024

[2] [2]

Analysing Lightweight Large Language Models for Biomedical Named Entity Recognition on Diverse Ouput Formats

Related Work Since the rise of LLMs there has been remarkable progress in the task of IE. This section discusses mostrecentmethodsforNERbasedonLLMs. Oth- ers are excluded for the consistency of comparison andspacelimitation. SeeXuetal.(Xuetal.,2024b) for a recent and detailed survey. Prompting based Approaches.Zero-shot Infor- mation Extraction (ZIE) aims...

work page internal anchor Pith review Pith/arXiv arXiv 2026

[3] [3]

(Ding et al., 2024) in- corporates negative instances in generative NER training onThe Pileopen-source corpus, improv- ing zero-shot performance on unseen entity do- mains

distills ChatGPT into smaller student mod- els through mission-focused instruction tuning for open NER, while Ding et al. (Ding et al., 2024) in- corporates negative instances in generative NER training onThe Pileopen-source corpus, improv- ing zero-shot performance on unseen entity do- mains. Supervised fine-tuning approaches further enhance model capabi...

work page 2024

[4] [4]

lung tumor

Methodology In this work, we focus on Causal Language Models (CLMs), which generate text autoregressively by predicting each token given its preceding context. Trained with the Next Token Prediction (NTP) ob- jective, CLMs naturally align with text generation tasks, making them well-suited for instruction tun- ing. WeadoptthisframeworktotreatNERasatext ge...

work page 2024

[5] [5]

Datasets WeselectedeightBioNERdatasetsforouranalysis

Experimental Setting 4.1. Datasets WeselectedeightBioNERdatasetsforouranalysis. See Table 1 for detailed statistics of each dataset. AnatEM(Pyysalo and Ananiadou, 2014) contains PubMed abstracts, annotated for anatomical enti- ties, including organs, tissues, and body parts. BioCreative II Gene Mention (BC2GM)(Smith et al., 2008) is a PubMed-based corpus ...

work page 2014

[6] [6]

disease" when the only possible label of AnatEM is “anatomy

Results The experimental results obtained indicate two pri- mary findings. First, it is observed that despite the discrepancy in model size between the two base models (500M for Qwen-2.5 and 1B for Llama-3.2), theperformancedoesnotsignificantlydecline. This is supported by the fact that the same applies to the baselines considered, which are all based on ...

work page

[7] [7]

Conclusion This study demonstrates that instruction-tuned, lightweight LLMs can achieve competitive perfor- mance in biomedical G-NER, challenging the dom- inance of larger-scale models. We identified the most effective output formats (formats conv_term and multi_triple) for representing biomedical enti- ties, including complex cases such as nested and di...

work page arXiv 2023

[8] [8]

InFindings of the Asso- ciationforComputationalLinguistics,ACL,pages 15992–16030

Proggen: Generating named entity recog- nition datasets step-by-step with self-reflexive large language models. InFindings of the Asso- ciationforComputationalLinguistics,ACL,pages 15992–16030. Association for Computational Linguistics. Xuming Hu, Yong Jiang, Aiwei Liu, Zhongqiang Huang, Pengjun Xie, Fei Huang, Lijie Wen, and Philip S. Yu. 2023. Entity-to...

work page arXiv 2023

[9] [9]

LLaMA: Open and Efficient Foundation Language Models

Overview of biocreative ii gene mention recognition.Genome biology, 9(Suppl 2):S2. Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. 2023. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971. Chenguang Wang,...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[10] [10]

Qwen2 Technical Report

Qwen2 technical report.arXiv preprint arXiv:2407.10671. Urchade Zaratiana, Nadi Tomeh, Pierre Holat, and Thierry Charnois. 2024. An autoregressive text- to-graphframeworkforjointentityandrelationex- traction. InThirty-Eighth AAAI Conference on Ar- tificial Intelligence, AAAI 2024, Thirty-Sixth Con- ference on Innovative Applications of Artificial Intellig...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[11] [11]

OpenReview.net. A. Appendix A.1. Training hyperparameters Table 8 lists the most important hyperparameters used for training. These were chosen to align with previous works, as explained above. The complete list of hyperparameters can be found on the GitHub repository: Training hyperparameters #epoch 15 #GPU 4 train batch size 2 gradient accumulation 8 le...

work page