pith. sign in

arxiv: 2604.06403 · v1 · submitted 2026-04-07 · 💻 cs.CL · cs.AI

FMI@SU ToxHabits: Evaluating LLMs Performance on Toxic Habit Extraction in Spanish Clinical Texts

Pith reviewed 2026-05-10 19:38 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords toxic habit extractionSpanish clinical textsLLM promptingnamed entity recognitionsubstance use detectionfew-shot promptingToxHabits task
0
0 comments X

The pith

GPT-4.1 few-shot prompting extracts toxic habit mentions from Spanish clinical texts at F1 0.65.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper evaluates large language models for recognizing named entities about toxic habits in Spanish clinical texts as part of the ToxHabits shared task. It targets subtask 1, which requires detecting mentions of substance use and abuse in clinical case reports and sorting them into Tobacco, Alcohol, Cannabis, or Drug categories. The authors tested zero-shot prompting, few-shot prompting, and prompt optimization. GPT-4.1 with few-shot prompting gave the strongest result, reaching an F1 score of 0.65 on the test set. The outcome points to LLMs as a workable route for named entity recognition in medical texts written in languages other than English.

Core claim

In the ToxHabits shared task subtask 1, few-shot prompting of GPT-4.1 achieved an F1 score of 0.65 when detecting substance use and abuse mentions in Spanish clinical case reports and classifying them into four categories: Tobacco, Alcohol, Cannabis, and Drug.

What carries the argument

Few-shot prompting of GPT-4.1 to identify and classify toxic habit named entities in Spanish clinical text.

If this is right

  • Spanish clinical documentation can be processed automatically to flag patient substance use patterns without language-specific retraining.
  • The 0.65 F1 result supplies a concrete baseline for LLM-based named entity recognition on non-English medical data.
  • Prompt-based methods can adapt general LLMs to specialized health categories such as substance mentions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same prompting setup could be tried on clinical texts in other Romance languages to test cross-lingual transfer.
  • Hospital electronic records could incorporate this approach for routine screening of toxic habits.
  • Combining the method with modest additional fine-tuning on domain data might raise the F1 score further.

Load-bearing premise

The few-shot prompting approach with GPT-4.1 generalizes beyond the specific ToxHabits test set and that the reported F1 reflects true capability rather than prompt overfitting or shared-task data characteristics.

What would settle it

Running the identical few-shot prompt on an independent collection of Spanish clinical case reports outside the ToxHabits dataset and checking whether the F1 score stays near 0.65.

Figures

Figures reproduced from arXiv: 2604.06403 by Ivan Koychev, Svetla Boytcheva, Sylvia Vassileva.

Figure 1
Figure 1. Figure 1: The process for mention extraction using an LLM. 5. Experiments and Results As a baseline approach, we created a dictionary from the train set and filtered the mentions that were unambiguously labeled as entities. Also, we trained a BERT-based model on token classification (Spanish Clinical RoBERTa), which has shown good results on Spanish clinical named entity recognition. We performed some preliminary sm… view at source ↗
read the original abstract

The paper presents an approach for the recognition of toxic habits named entities in Spanish clinical texts. The approach was developed for the ToxHabits Shared Task. Our team participated in subtask 1, which aims to detect substance use and abuse mentions in clinical case reports and classify them in four categories (Tobacco, Alcohol, Cannabis, and Drug). We explored various methods of utilizing LLMs for the task, including zero-shot, few-shot, and prompt optimization, and found that GPT-4.1's few-shot prompting performed the best in our experiments. Our method achieved an F1 score of 0.65 on the test set, demonstrating a promising result for recognizing named entities in languages other than English.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The manuscript reports the FMI@SU team's participation in subtask 1 of the ToxHabits shared task, which requires detecting and classifying mentions of toxic habits (Tobacco, Alcohol, Cannabis, Drug) as named entities in Spanish clinical case reports. The authors test zero-shot, few-shot, and prompt-optimization strategies with LLMs and state that GPT-4.1 few-shot prompting performed best, achieving an F1 score of 0.65 on the held-out test set.

Significance. If the experimental protocol and comparisons are supplied, the work would provide a useful data point on the viability of instruction-tuned LLMs for clinical NER in Spanish, a setting where labeled data are scarce. The 0.65 F1 figure is modest but could serve as a reference for future prompting or fine-tuning studies once baselines and reproducibility details are added.

major comments (3)
  1. [Abstract] Abstract: the claim that 'GPT-4.1's few-shot prompting performed the best' is unsupported because no per-configuration F1 scores, shot counts, or prompt templates are reported, nor is any comparison to non-LLM baselines (e.g., fine-tuned XLM-R or CRF) supplied on the same data split.
  2. [Results] Results section: the central performance number (F1 = 0.65) is presented without error bars, statistical significance tests, or information on the train/dev/test split sizes, making it impossible to assess whether the score reflects model capability or test-set characteristics.
  3. [Methods] Methods: the description of the three prompting regimes (zero-shot, few-shot, prompt optimization) contains no concrete prompt text, example shots, or optimization procedure, which are load-bearing for reproducing or interpreting the reported superiority of the GPT-4.1 few-shot run.
minor comments (1)
  1. [Abstract] The abstract and title could more explicitly name the shared task and subtask to improve discoverability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the valuable feedback on our manuscript. We have carefully considered each comment and will make revisions to improve the reporting of our experimental details and results.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that 'GPT-4.1's few-shot prompting performed the best' is unsupported because no per-configuration F1 scores, shot counts, or prompt templates are reported, nor is any comparison to non-LLM baselines (e.g., fine-tuned XLM-R or CRF) supplied on the same data split.

    Authors: We agree that additional details are needed to support the claim in the abstract. In the revised version, we will report the F1 scores for the different prompting configurations tested, including the number of shots used, and summarize the prompt templates. Regarding non-LLM baselines, our study was specifically designed to evaluate LLM prompting strategies in the context of the shared task; we did not implement traditional NER models such as XLM-R or CRF. We will clarify this scope in the abstract and methods. revision: partial

  2. Referee: [Results] Results section: the central performance number (F1 = 0.65) is presented without error bars, statistical significance tests, or information on the train/dev/test split sizes, making it impossible to assess whether the score reflects model capability or test-set characteristics.

    Authors: We will update the Results section to include the sizes of the train, development, and test splits as provided by the ToxHabits shared task. As the evaluation followed the official single test set protocol without multiple random seeds or runs, error bars and statistical significance tests were not computed. We will explicitly state this in the revised manuscript to avoid misinterpretation. revision: partial

  3. Referee: [Methods] Methods: the description of the three prompting regimes (zero-shot, few-shot, prompt optimization) contains no concrete prompt text, example shots, or optimization procedure, which are load-bearing for reproducing or interpreting the reported superiority of the GPT-4.1 few-shot run.

    Authors: We will expand the Methods section to include the concrete prompt texts for each regime, specific examples of the few-shot instances used, and a detailed description of the prompt optimization procedure. This will enable full reproducibility of our experiments. revision: yes

Circularity Check

0 steps flagged

No circularity in empirical LLM evaluation on shared task

full rationale

The manuscript is a purely empirical report of LLM prompting experiments (zero-shot, few-shot, prompt optimization) on the ToxHabits shared-task test set for Spanish clinical NER. It states that GPT-4.1 few-shot prompting yielded the highest F1 of 0.65 but contains no equations, fitted parameters, derivations, or self-citations that reduce any claim to its own inputs by construction. All reported results are direct measurements on held-out data; the work is therefore self-contained with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical evaluation paper with no mathematical derivations, free parameters, or invented entities. The central claim rests on the assumption that the shared-task test set is a valid benchmark and that LLM prompting performance can be meaningfully compared across configurations.

pith-pipeline@v0.9.0 · 5425 in / 1325 out tokens · 50739 ms · 2026-05-10T19:38:45.350885+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

17 extracted references · 17 canonical work pages

  1. [1]

    Hyoun-Joong, Managing unstructured big data in healthcare system, Healthc Inform Res 25 (2019) 1–2

    K. Hyoun-Joong, Managing unstructured big data in healthcare system, Healthc Inform Res 25 (2019) 1–2. URL: http://e-hir.org/journal/view.php?number=999. doi:10.4258/hir.2019.25.1. 1.arXiv:http://e-hir.org/journal/view.php?number=999

  2. [2]

    Al-Nabki, S

    W. Al-Nabki, S. Lima-López, G. Vayá-Abad, , M. Krallinger, Overview of toxhabits at biocreative ix: corpus, guidelines and evaluation of systems for the detection of toxic habits from text, in: BioCreative IX Challenge and Workshop (BC9): Large Language Models for Clinical and Biomedical NLP at the International Joint Conference on Artificial Intelligence...

  3. [3]

    Z. Lu, Y. Peng, T. Cohen, M. Ghassemi, C. Weng, S. Tian, Large language models in biomedicine and health: current research landscape and future direc- tions, Journal of the American Medical Informatics Association 31 (2024) 1801–

  4. [4]

    doi: 10.1093/jamia/ocae202

    URL: https://doi.org/10.1093/jamia/ocae202. doi: 10.1093/jamia/ocae202. arXiv:https://academic.oup.com/jamia/article-pdf/31/9/1801/58868285/ocae202.pdf

  5. [5]

    Y. Hu, Q. Chen, J. Du, X. Peng, V. K. Keloth, X. Zuo, Y. Zhou, Z. Li, X. Jiang, Z. Lu, et al., Improving large language models for clinical named entity recognition via prompt engineering, Journal of the American Medical Informatics Association 31 (2024) 1812–1820

  6. [6]

    LLMs in biomedicine: a study on clinical named entity recognition,

    M. Monajatipoor, J. Yang, J. Stremmel, M. Emami, F. Mohaghegh, M. Rouhsedaghat, K.-W. Chang, Llms in biomedicine: A study on clinical named entity recognition, arXiv preprint arXiv:2404.07376 (2024)

  7. [7]

    J. Bian, J. Zheng, Y. Zhang, S. Zhu, Inspire the large language model by external knowledge on biomedical named entity recognition, arXiv preprint arXiv:2309.12278 (2023)

  8. [8]

    García-Barragán, A

    Á. García-Barragán, A. Sakor, M.-E. Vidal, E. Menasalvas, J. C. S. Gonzalez, M. Provencio, V. Robles, Nssc: a neuro-symbolic ai system for enhancing accuracy of named entity recognition and linking from oncologic clinical notes, Medical & Biological Engineering & Computing 63 (2025) 749–772

  9. [9]

    Rohanian, M

    O. Rohanian, M. Nouriborji, S. Kouchaki, F. Nooralahzadeh, L. Clifton, D. A. Clifton, Explor- ing the effectiveness of instruction tuning in biomedical language processing, Artificial Intel- ligence in Medicine 158 (2024) 103007. URL: https://www.sciencedirect.com/science/article/pii/ S0933365724002495. doi:https://doi.org/10.1016/j.artmed.2024.103007

  10. [10]

    W. Zhou, S. Zhang, Y. Gu, M. Chen, H. Poon, Universalner: Targeted distillation from large language models for open named entity recognition (2023).arXiv:2308.03279

  11. [11]

    Q. Lu, R. Li, A. Wen, J. Wang, L. Wang, H. Liu, Large language models struggle in token-level clinical named entity recognition, in: AMIA Annual Symposium Proceedings, volume 2024, 2025, p. 748

  12. [12]

    Bodenreider, The unified medical language system (UMLS): integrating biomedical terminology, Nucleic Acids Res

    O. Bodenreider, The unified medical language system (UMLS): integrating biomedical terminology, Nucleic Acids Res. 32 (2004) D267–70

  13. [13]

    K. B. Cohen, K. Verspoor, K. Fort, C. Funk, M. Bada, M. Palmer, L. E. Hunter, The colorado richly annotated full text (craft) corpus: Multi-model annotation in the biomedical domain, Handbook of Linguistic annotation (2017) 1379–1394

  14. [14]

    Biana, W

    J. Biana, W. Zhai, X. Huang, J. Zheng, S. Zhu, Vaner: leveraging large language model for versatile and adaptive biomedical named entity recognition, arXiv preprint arXiv:2404.17835 (2024)

  15. [15]

    Lima-López, W

    S. Lima-López, W. Alnabki, G. Vayá-Abad, M. Krallinger, Toxhabits-ner: A gold-standard annotated dataset for named entity recognition in toxic habits context, 2025. URL: https://doi.org/10.5281/ zenodo.15538314. doi:10.5281/zenodo.15538314

  16. [16]

    Khattab, A

    O. Khattab, A. Singhvi, P. Maheshwari, Z. Zhang, K. Santhanam, S. Vardhamanan, S. Haq, A. Sharma, T. T. Joshi, H. Moazam, H. Miller, M. Zaharia, C. Potts, Dspy: Compiling declarative language model calls into self-improving pipelines, 2024

  17. [17]

    Sarmah, K

    B. Sarmah, K. Dutta, A. Grigoryan, S. Tiwari, S. Pasquali, D. Mehta, A comparative study of dspy teleprompter algorithms for aligning large language models evaluation metrics to human evaluation, 2024. URL: https://arxiv.org/abs/2412.15298.arXiv:2412.15298