FMI@SU ToxHabits: Evaluating LLMs Performance on Toxic Habit Extraction in Spanish Clinical Texts

Ivan Koychev; Svetla Boytcheva; Sylvia Vassileva

arxiv: 2604.06403 · v1 · submitted 2026-04-07 · 💻 cs.CL · cs.AI

FMI@SU ToxHabits: Evaluating LLMs Performance on Toxic Habit Extraction in Spanish Clinical Texts

Sylvia Vassileva , Ivan Koychev , Svetla Boytcheva This is my paper

Pith reviewed 2026-05-10 19:38 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords toxic habit extractionSpanish clinical textsLLM promptingnamed entity recognitionsubstance use detectionfew-shot promptingToxHabits task

0 comments

The pith

GPT-4.1 few-shot prompting extracts toxic habit mentions from Spanish clinical texts at F1 0.65.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper evaluates large language models for recognizing named entities about toxic habits in Spanish clinical texts as part of the ToxHabits shared task. It targets subtask 1, which requires detecting mentions of substance use and abuse in clinical case reports and sorting them into Tobacco, Alcohol, Cannabis, or Drug categories. The authors tested zero-shot prompting, few-shot prompting, and prompt optimization. GPT-4.1 with few-shot prompting gave the strongest result, reaching an F1 score of 0.65 on the test set. The outcome points to LLMs as a workable route for named entity recognition in medical texts written in languages other than English.

Core claim

In the ToxHabits shared task subtask 1, few-shot prompting of GPT-4.1 achieved an F1 score of 0.65 when detecting substance use and abuse mentions in Spanish clinical case reports and classifying them into four categories: Tobacco, Alcohol, Cannabis, and Drug.

What carries the argument

Few-shot prompting of GPT-4.1 to identify and classify toxic habit named entities in Spanish clinical text.

If this is right

Spanish clinical documentation can be processed automatically to flag patient substance use patterns without language-specific retraining.
The 0.65 F1 result supplies a concrete baseline for LLM-based named entity recognition on non-English medical data.
Prompt-based methods can adapt general LLMs to specialized health categories such as substance mentions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same prompting setup could be tried on clinical texts in other Romance languages to test cross-lingual transfer.
Hospital electronic records could incorporate this approach for routine screening of toxic habits.
Combining the method with modest additional fine-tuning on domain data might raise the F1 score further.

Load-bearing premise

The few-shot prompting approach with GPT-4.1 generalizes beyond the specific ToxHabits test set and that the reported F1 reflects true capability rather than prompt overfitting or shared-task data characteristics.

What would settle it

Running the identical few-shot prompt on an independent collection of Spanish clinical case reports outside the ToxHabits dataset and checking whether the F1 score stays near 0.65.

Figures

Figures reproduced from arXiv: 2604.06403 by Ivan Koychev, Svetla Boytcheva, Sylvia Vassileva.

**Figure 1.** Figure 1: The process for mention extraction using an LLM. 5. Experiments and Results As a baseline approach, we created a dictionary from the train set and filtered the mentions that were unambiguously labeled as entities. Also, we trained a BERT-based model on token classification (Spanish Clinical RoBERTa), which has shown good results on Spanish clinical named entity recognition. We performed some preliminary sm… view at source ↗

read the original abstract

The paper presents an approach for the recognition of toxic habits named entities in Spanish clinical texts. The approach was developed for the ToxHabits Shared Task. Our team participated in subtask 1, which aims to detect substance use and abuse mentions in clinical case reports and classify them in four categories (Tobacco, Alcohol, Cannabis, and Drug). We explored various methods of utilizing LLMs for the task, including zero-shot, few-shot, and prompt optimization, and found that GPT-4.1's few-shot prompting performed the best in our experiments. Our method achieved an F1 score of 0.65 on the test set, demonstrating a promising result for recognizing named entities in languages other than English.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Paper reports 0.65 F1 with GPT-4.1 few-shot on ToxHabits Spanish subtask but lacks baselines, prompts, and comparisons.

read the letter

This is a straightforward shared-task participation report on extracting tobacco, alcohol, cannabis, and drug mentions from Spanish clinical texts using LLMs. They tested zero-shot, few-shot, and prompt optimization, with GPT-4.1 few-shot coming out on top at 0.65 F1 on the held-out test set. The work is new only in the narrow sense of applying these standard techniques to the ToxHabits Spanish data; the methods themselves are not novel. What it does well is deliver a clean, honest data point for non-English clinical NER, which is still relatively sparse. The authors clearly ran the experiments and reported the outcome without overclaiming. The soft spots are exactly where the stress-test note flags them: no prompt templates or shot counts are shown, no per-configuration scores appear, and there are no non-LLM baselines such as fine-tuned XLM-R or CRF on the same split. Without those, 0.65 cannot be judged as strong or weak. Error analysis by entity type is also absent. This paper is useful for anyone tracking clinical NER shared tasks or needing a quick Spanish LLM reference. It is not a methodological advance. I would accept it for peer review so referees can request the missing method details and at least one traditional baseline; the core experiment is sound enough to be worth that effort.

Referee Report

3 major / 1 minor

Summary. The manuscript reports the FMI@SU team's participation in subtask 1 of the ToxHabits shared task, which requires detecting and classifying mentions of toxic habits (Tobacco, Alcohol, Cannabis, Drug) as named entities in Spanish clinical case reports. The authors test zero-shot, few-shot, and prompt-optimization strategies with LLMs and state that GPT-4.1 few-shot prompting performed best, achieving an F1 score of 0.65 on the held-out test set.

Significance. If the experimental protocol and comparisons are supplied, the work would provide a useful data point on the viability of instruction-tuned LLMs for clinical NER in Spanish, a setting where labeled data are scarce. The 0.65 F1 figure is modest but could serve as a reference for future prompting or fine-tuning studies once baselines and reproducibility details are added.

major comments (3)

[Abstract] Abstract: the claim that 'GPT-4.1's few-shot prompting performed the best' is unsupported because no per-configuration F1 scores, shot counts, or prompt templates are reported, nor is any comparison to non-LLM baselines (e.g., fine-tuned XLM-R or CRF) supplied on the same data split.
[Results] Results section: the central performance number (F1 = 0.65) is presented without error bars, statistical significance tests, or information on the train/dev/test split sizes, making it impossible to assess whether the score reflects model capability or test-set characteristics.
[Methods] Methods: the description of the three prompting regimes (zero-shot, few-shot, prompt optimization) contains no concrete prompt text, example shots, or optimization procedure, which are load-bearing for reproducing or interpreting the reported superiority of the GPT-4.1 few-shot run.

minor comments (1)

[Abstract] The abstract and title could more explicitly name the shared task and subtask to improve discoverability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the valuable feedback on our manuscript. We have carefully considered each comment and will make revisions to improve the reporting of our experimental details and results.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that 'GPT-4.1's few-shot prompting performed the best' is unsupported because no per-configuration F1 scores, shot counts, or prompt templates are reported, nor is any comparison to non-LLM baselines (e.g., fine-tuned XLM-R or CRF) supplied on the same data split.

Authors: We agree that additional details are needed to support the claim in the abstract. In the revised version, we will report the F1 scores for the different prompting configurations tested, including the number of shots used, and summarize the prompt templates. Regarding non-LLM baselines, our study was specifically designed to evaluate LLM prompting strategies in the context of the shared task; we did not implement traditional NER models such as XLM-R or CRF. We will clarify this scope in the abstract and methods. revision: partial
Referee: [Results] Results section: the central performance number (F1 = 0.65) is presented without error bars, statistical significance tests, or information on the train/dev/test split sizes, making it impossible to assess whether the score reflects model capability or test-set characteristics.

Authors: We will update the Results section to include the sizes of the train, development, and test splits as provided by the ToxHabits shared task. As the evaluation followed the official single test set protocol without multiple random seeds or runs, error bars and statistical significance tests were not computed. We will explicitly state this in the revised manuscript to avoid misinterpretation. revision: partial
Referee: [Methods] Methods: the description of the three prompting regimes (zero-shot, few-shot, prompt optimization) contains no concrete prompt text, example shots, or optimization procedure, which are load-bearing for reproducing or interpreting the reported superiority of the GPT-4.1 few-shot run.

Authors: We will expand the Methods section to include the concrete prompt texts for each regime, specific examples of the few-shot instances used, and a detailed description of the prompt optimization procedure. This will enable full reproducibility of our experiments. revision: yes

Circularity Check

0 steps flagged

No circularity in empirical LLM evaluation on shared task

full rationale

The manuscript is a purely empirical report of LLM prompting experiments (zero-shot, few-shot, prompt optimization) on the ToxHabits shared-task test set for Spanish clinical NER. It states that GPT-4.1 few-shot prompting yielded the highest F1 of 0.65 but contains no equations, fitted parameters, derivations, or self-citations that reduce any claim to its own inputs by construction. All reported results are direct measurements on held-out data; the work is therefore self-contained with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical evaluation paper with no mathematical derivations, free parameters, or invented entities. The central claim rests on the assumption that the shared-task test set is a valid benchmark and that LLM prompting performance can be meaningfully compared across configurations.

pith-pipeline@v0.9.0 · 5425 in / 1325 out tokens · 50739 ms · 2026-05-10T19:38:45.350885+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

17 extracted references · 17 canonical work pages

[1]

Hyoun-Joong, Managing unstructured big data in healthcare system, Healthc Inform Res 25 (2019) 1–2

K. Hyoun-Joong, Managing unstructured big data in healthcare system, Healthc Inform Res 25 (2019) 1–2. URL: http://e-hir.org/journal/view.php?number=999. doi:10.4258/hir.2019.25.1. 1.arXiv:http://e-hir.org/journal/view.php?number=999

work page doi:10.4258/hir.2019.25.1 2019
[2]

Al-Nabki, S

W. Al-Nabki, S. Lima-López, G. Vayá-Abad, , M. Krallinger, Overview of toxhabits at biocreative ix: corpus, guidelines and evaluation of systems for the detection of toxic habits from text, in: BioCreative IX Challenge and Workshop (BC9): Large Language Models for Clinical and Biomedical NLP at the International Joint Conference on Artificial Intelligence...

work page 2025
[3]

Z. Lu, Y. Peng, T. Cohen, M. Ghassemi, C. Weng, S. Tian, Large language models in biomedicine and health: current research landscape and future direc- tions, Journal of the American Medical Informatics Association 31 (2024) 1801–

work page 2024
[4]

doi: 10.1093/jamia/ocae202

URL: https://doi.org/10.1093/jamia/ocae202. doi: 10.1093/jamia/ocae202. arXiv:https://academic.oup.com/jamia/article-pdf/31/9/1801/58868285/ocae202.pdf

work page doi:10.1093/jamia/ocae202
[5]

Y. Hu, Q. Chen, J. Du, X. Peng, V. K. Keloth, X. Zuo, Y. Zhou, Z. Li, X. Jiang, Z. Lu, et al., Improving large language models for clinical named entity recognition via prompt engineering, Journal of the American Medical Informatics Association 31 (2024) 1812–1820

work page 2024
[6]

LLMs in biomedicine: a study on clinical named entity recognition,

M. Monajatipoor, J. Yang, J. Stremmel, M. Emami, F. Mohaghegh, M. Rouhsedaghat, K.-W. Chang, Llms in biomedicine: A study on clinical named entity recognition, arXiv preprint arXiv:2404.07376 (2024)

work page arXiv 2024
[7]

J. Bian, J. Zheng, Y. Zhang, S. Zhu, Inspire the large language model by external knowledge on biomedical named entity recognition, arXiv preprint arXiv:2309.12278 (2023)

work page arXiv 2023
[8]

García-Barragán, A

Á. García-Barragán, A. Sakor, M.-E. Vidal, E. Menasalvas, J. C. S. Gonzalez, M. Provencio, V. Robles, Nssc: a neuro-symbolic ai system for enhancing accuracy of named entity recognition and linking from oncologic clinical notes, Medical & Biological Engineering & Computing 63 (2025) 749–772

work page 2025
[9]

Rohanian, M

O. Rohanian, M. Nouriborji, S. Kouchaki, F. Nooralahzadeh, L. Clifton, D. A. Clifton, Explor- ing the effectiveness of instruction tuning in biomedical language processing, Artificial Intel- ligence in Medicine 158 (2024) 103007. URL: https://www.sciencedirect.com/science/article/pii/ S0933365724002495. doi:https://doi.org/10.1016/j.artmed.2024.103007

work page doi:10.1016/j.artmed.2024.103007 2024
[10]

W. Zhou, S. Zhang, Y. Gu, M. Chen, H. Poon, Universalner: Targeted distillation from large language models for open named entity recognition (2023).arXiv:2308.03279

work page arXiv 2023
[11]

Q. Lu, R. Li, A. Wen, J. Wang, L. Wang, H. Liu, Large language models struggle in token-level clinical named entity recognition, in: AMIA Annual Symposium Proceedings, volume 2024, 2025, p. 748

work page 2024
[12]

Bodenreider, The unified medical language system (UMLS): integrating biomedical terminology, Nucleic Acids Res

O. Bodenreider, The unified medical language system (UMLS): integrating biomedical terminology, Nucleic Acids Res. 32 (2004) D267–70

work page 2004
[13]

K. B. Cohen, K. Verspoor, K. Fort, C. Funk, M. Bada, M. Palmer, L. E. Hunter, The colorado richly annotated full text (craft) corpus: Multi-model annotation in the biomedical domain, Handbook of Linguistic annotation (2017) 1379–1394

work page 2017
[14]

Biana, W

J. Biana, W. Zhai, X. Huang, J. Zheng, S. Zhu, Vaner: leveraging large language model for versatile and adaptive biomedical named entity recognition, arXiv preprint arXiv:2404.17835 (2024)

work page arXiv 2024
[15]

Lima-López, W

S. Lima-López, W. Alnabki, G. Vayá-Abad, M. Krallinger, Toxhabits-ner: A gold-standard annotated dataset for named entity recognition in toxic habits context, 2025. URL: https://doi.org/10.5281/ zenodo.15538314. doi:10.5281/zenodo.15538314

work page doi:10.5281/zenodo.15538314 2025
[16]

Khattab, A

O. Khattab, A. Singhvi, P. Maheshwari, Z. Zhang, K. Santhanam, S. Vardhamanan, S. Haq, A. Sharma, T. T. Joshi, H. Moazam, H. Miller, M. Zaharia, C. Potts, Dspy: Compiling declarative language model calls into self-improving pipelines, 2024

work page 2024
[17]

Sarmah, K

B. Sarmah, K. Dutta, A. Grigoryan, S. Tiwari, S. Pasquali, D. Mehta, A comparative study of dspy teleprompter algorithms for aligning large language models evaluation metrics to human evaluation, 2024. URL: https://arxiv.org/abs/2412.15298.arXiv:2412.15298

work page arXiv 2024

[1] [1]

Hyoun-Joong, Managing unstructured big data in healthcare system, Healthc Inform Res 25 (2019) 1–2

K. Hyoun-Joong, Managing unstructured big data in healthcare system, Healthc Inform Res 25 (2019) 1–2. URL: http://e-hir.org/journal/view.php?number=999. doi:10.4258/hir.2019.25.1. 1.arXiv:http://e-hir.org/journal/view.php?number=999

work page doi:10.4258/hir.2019.25.1 2019

[2] [2]

Al-Nabki, S

W. Al-Nabki, S. Lima-López, G. Vayá-Abad, , M. Krallinger, Overview of toxhabits at biocreative ix: corpus, guidelines and evaluation of systems for the detection of toxic habits from text, in: BioCreative IX Challenge and Workshop (BC9): Large Language Models for Clinical and Biomedical NLP at the International Joint Conference on Artificial Intelligence...

work page 2025

[3] [3]

Z. Lu, Y. Peng, T. Cohen, M. Ghassemi, C. Weng, S. Tian, Large language models in biomedicine and health: current research landscape and future direc- tions, Journal of the American Medical Informatics Association 31 (2024) 1801–

work page 2024

[4] [4]

doi: 10.1093/jamia/ocae202

URL: https://doi.org/10.1093/jamia/ocae202. doi: 10.1093/jamia/ocae202. arXiv:https://academic.oup.com/jamia/article-pdf/31/9/1801/58868285/ocae202.pdf

work page doi:10.1093/jamia/ocae202

[5] [5]

Y. Hu, Q. Chen, J. Du, X. Peng, V. K. Keloth, X. Zuo, Y. Zhou, Z. Li, X. Jiang, Z. Lu, et al., Improving large language models for clinical named entity recognition via prompt engineering, Journal of the American Medical Informatics Association 31 (2024) 1812–1820

work page 2024

[6] [6]

LLMs in biomedicine: a study on clinical named entity recognition,

M. Monajatipoor, J. Yang, J. Stremmel, M. Emami, F. Mohaghegh, M. Rouhsedaghat, K.-W. Chang, Llms in biomedicine: A study on clinical named entity recognition, arXiv preprint arXiv:2404.07376 (2024)

work page arXiv 2024

[7] [7]

J. Bian, J. Zheng, Y. Zhang, S. Zhu, Inspire the large language model by external knowledge on biomedical named entity recognition, arXiv preprint arXiv:2309.12278 (2023)

work page arXiv 2023

[8] [8]

García-Barragán, A

Á. García-Barragán, A. Sakor, M.-E. Vidal, E. Menasalvas, J. C. S. Gonzalez, M. Provencio, V. Robles, Nssc: a neuro-symbolic ai system for enhancing accuracy of named entity recognition and linking from oncologic clinical notes, Medical & Biological Engineering & Computing 63 (2025) 749–772

work page 2025

[9] [9]

Rohanian, M

O. Rohanian, M. Nouriborji, S. Kouchaki, F. Nooralahzadeh, L. Clifton, D. A. Clifton, Explor- ing the effectiveness of instruction tuning in biomedical language processing, Artificial Intel- ligence in Medicine 158 (2024) 103007. URL: https://www.sciencedirect.com/science/article/pii/ S0933365724002495. doi:https://doi.org/10.1016/j.artmed.2024.103007

work page doi:10.1016/j.artmed.2024.103007 2024

[10] [10]

W. Zhou, S. Zhang, Y. Gu, M. Chen, H. Poon, Universalner: Targeted distillation from large language models for open named entity recognition (2023).arXiv:2308.03279

work page arXiv 2023

[11] [11]

Q. Lu, R. Li, A. Wen, J. Wang, L. Wang, H. Liu, Large language models struggle in token-level clinical named entity recognition, in: AMIA Annual Symposium Proceedings, volume 2024, 2025, p. 748

work page 2024

[12] [12]

Bodenreider, The unified medical language system (UMLS): integrating biomedical terminology, Nucleic Acids Res

O. Bodenreider, The unified medical language system (UMLS): integrating biomedical terminology, Nucleic Acids Res. 32 (2004) D267–70

work page 2004

[13] [13]

K. B. Cohen, K. Verspoor, K. Fort, C. Funk, M. Bada, M. Palmer, L. E. Hunter, The colorado richly annotated full text (craft) corpus: Multi-model annotation in the biomedical domain, Handbook of Linguistic annotation (2017) 1379–1394

work page 2017

[14] [14]

Biana, W

J. Biana, W. Zhai, X. Huang, J. Zheng, S. Zhu, Vaner: leveraging large language model for versatile and adaptive biomedical named entity recognition, arXiv preprint arXiv:2404.17835 (2024)

work page arXiv 2024

[15] [15]

Lima-López, W

S. Lima-López, W. Alnabki, G. Vayá-Abad, M. Krallinger, Toxhabits-ner: A gold-standard annotated dataset for named entity recognition in toxic habits context, 2025. URL: https://doi.org/10.5281/ zenodo.15538314. doi:10.5281/zenodo.15538314

work page doi:10.5281/zenodo.15538314 2025

[16] [16]

Khattab, A

O. Khattab, A. Singhvi, P. Maheshwari, Z. Zhang, K. Santhanam, S. Vardhamanan, S. Haq, A. Sharma, T. T. Joshi, H. Moazam, H. Miller, M. Zaharia, C. Potts, Dspy: Compiling declarative language model calls into self-improving pipelines, 2024

work page 2024

[17] [17]

Sarmah, K

B. Sarmah, K. Dutta, A. Grigoryan, S. Tiwari, S. Pasquali, D. Mehta, A comparative study of dspy teleprompter algorithms for aligning large language models evaluation metrics to human evaluation, 2024. URL: https://arxiv.org/abs/2412.15298.arXiv:2412.15298

work page arXiv 2024