FMI@SU ToxHabits: Evaluating LLMs Performance on Toxic Habit Extraction in Spanish Clinical Texts
Pith reviewed 2026-05-10 19:38 UTC · model grok-4.3
The pith
GPT-4.1 few-shot prompting extracts toxic habit mentions from Spanish clinical texts at F1 0.65.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
In the ToxHabits shared task subtask 1, few-shot prompting of GPT-4.1 achieved an F1 score of 0.65 when detecting substance use and abuse mentions in Spanish clinical case reports and classifying them into four categories: Tobacco, Alcohol, Cannabis, and Drug.
What carries the argument
Few-shot prompting of GPT-4.1 to identify and classify toxic habit named entities in Spanish clinical text.
If this is right
- Spanish clinical documentation can be processed automatically to flag patient substance use patterns without language-specific retraining.
- The 0.65 F1 result supplies a concrete baseline for LLM-based named entity recognition on non-English medical data.
- Prompt-based methods can adapt general LLMs to specialized health categories such as substance mentions.
Where Pith is reading between the lines
- The same prompting setup could be tried on clinical texts in other Romance languages to test cross-lingual transfer.
- Hospital electronic records could incorporate this approach for routine screening of toxic habits.
- Combining the method with modest additional fine-tuning on domain data might raise the F1 score further.
Load-bearing premise
The few-shot prompting approach with GPT-4.1 generalizes beyond the specific ToxHabits test set and that the reported F1 reflects true capability rather than prompt overfitting or shared-task data characteristics.
What would settle it
Running the identical few-shot prompt on an independent collection of Spanish clinical case reports outside the ToxHabits dataset and checking whether the F1 score stays near 0.65.
Figures
read the original abstract
The paper presents an approach for the recognition of toxic habits named entities in Spanish clinical texts. The approach was developed for the ToxHabits Shared Task. Our team participated in subtask 1, which aims to detect substance use and abuse mentions in clinical case reports and classify them in four categories (Tobacco, Alcohol, Cannabis, and Drug). We explored various methods of utilizing LLMs for the task, including zero-shot, few-shot, and prompt optimization, and found that GPT-4.1's few-shot prompting performed the best in our experiments. Our method achieved an F1 score of 0.65 on the test set, demonstrating a promising result for recognizing named entities in languages other than English.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript reports the FMI@SU team's participation in subtask 1 of the ToxHabits shared task, which requires detecting and classifying mentions of toxic habits (Tobacco, Alcohol, Cannabis, Drug) as named entities in Spanish clinical case reports. The authors test zero-shot, few-shot, and prompt-optimization strategies with LLMs and state that GPT-4.1 few-shot prompting performed best, achieving an F1 score of 0.65 on the held-out test set.
Significance. If the experimental protocol and comparisons are supplied, the work would provide a useful data point on the viability of instruction-tuned LLMs for clinical NER in Spanish, a setting where labeled data are scarce. The 0.65 F1 figure is modest but could serve as a reference for future prompting or fine-tuning studies once baselines and reproducibility details are added.
major comments (3)
- [Abstract] Abstract: the claim that 'GPT-4.1's few-shot prompting performed the best' is unsupported because no per-configuration F1 scores, shot counts, or prompt templates are reported, nor is any comparison to non-LLM baselines (e.g., fine-tuned XLM-R or CRF) supplied on the same data split.
- [Results] Results section: the central performance number (F1 = 0.65) is presented without error bars, statistical significance tests, or information on the train/dev/test split sizes, making it impossible to assess whether the score reflects model capability or test-set characteristics.
- [Methods] Methods: the description of the three prompting regimes (zero-shot, few-shot, prompt optimization) contains no concrete prompt text, example shots, or optimization procedure, which are load-bearing for reproducing or interpreting the reported superiority of the GPT-4.1 few-shot run.
minor comments (1)
- [Abstract] The abstract and title could more explicitly name the shared task and subtask to improve discoverability.
Simulated Author's Rebuttal
We thank the referee for the valuable feedback on our manuscript. We have carefully considered each comment and will make revisions to improve the reporting of our experimental details and results.
read point-by-point responses
-
Referee: [Abstract] Abstract: the claim that 'GPT-4.1's few-shot prompting performed the best' is unsupported because no per-configuration F1 scores, shot counts, or prompt templates are reported, nor is any comparison to non-LLM baselines (e.g., fine-tuned XLM-R or CRF) supplied on the same data split.
Authors: We agree that additional details are needed to support the claim in the abstract. In the revised version, we will report the F1 scores for the different prompting configurations tested, including the number of shots used, and summarize the prompt templates. Regarding non-LLM baselines, our study was specifically designed to evaluate LLM prompting strategies in the context of the shared task; we did not implement traditional NER models such as XLM-R or CRF. We will clarify this scope in the abstract and methods. revision: partial
-
Referee: [Results] Results section: the central performance number (F1 = 0.65) is presented without error bars, statistical significance tests, or information on the train/dev/test split sizes, making it impossible to assess whether the score reflects model capability or test-set characteristics.
Authors: We will update the Results section to include the sizes of the train, development, and test splits as provided by the ToxHabits shared task. As the evaluation followed the official single test set protocol without multiple random seeds or runs, error bars and statistical significance tests were not computed. We will explicitly state this in the revised manuscript to avoid misinterpretation. revision: partial
-
Referee: [Methods] Methods: the description of the three prompting regimes (zero-shot, few-shot, prompt optimization) contains no concrete prompt text, example shots, or optimization procedure, which are load-bearing for reproducing or interpreting the reported superiority of the GPT-4.1 few-shot run.
Authors: We will expand the Methods section to include the concrete prompt texts for each regime, specific examples of the few-shot instances used, and a detailed description of the prompt optimization procedure. This will enable full reproducibility of our experiments. revision: yes
Circularity Check
No circularity in empirical LLM evaluation on shared task
full rationale
The manuscript is a purely empirical report of LLM prompting experiments (zero-shot, few-shot, prompt optimization) on the ToxHabits shared-task test set for Spanish clinical NER. It states that GPT-4.1 few-shot prompting yielded the highest F1 of 0.65 but contains no equations, fitted parameters, derivations, or self-citations that reduce any claim to its own inputs by construction. All reported results are direct measurements on held-out data; the work is therefore self-contained with no load-bearing circular steps.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Hyoun-Joong, Managing unstructured big data in healthcare system, Healthc Inform Res 25 (2019) 1–2
K. Hyoun-Joong, Managing unstructured big data in healthcare system, Healthc Inform Res 25 (2019) 1–2. URL: http://e-hir.org/journal/view.php?number=999. doi:10.4258/hir.2019.25.1. 1.arXiv:http://e-hir.org/journal/view.php?number=999
-
[2]
W. Al-Nabki, S. Lima-López, G. Vayá-Abad, , M. Krallinger, Overview of toxhabits at biocreative ix: corpus, guidelines and evaluation of systems for the detection of toxic habits from text, in: BioCreative IX Challenge and Workshop (BC9): Large Language Models for Clinical and Biomedical NLP at the International Joint Conference on Artificial Intelligence...
work page 2025
-
[3]
Z. Lu, Y. Peng, T. Cohen, M. Ghassemi, C. Weng, S. Tian, Large language models in biomedicine and health: current research landscape and future direc- tions, Journal of the American Medical Informatics Association 31 (2024) 1801–
work page 2024
-
[4]
URL: https://doi.org/10.1093/jamia/ocae202. doi: 10.1093/jamia/ocae202. arXiv:https://academic.oup.com/jamia/article-pdf/31/9/1801/58868285/ocae202.pdf
-
[5]
Y. Hu, Q. Chen, J. Du, X. Peng, V. K. Keloth, X. Zuo, Y. Zhou, Z. Li, X. Jiang, Z. Lu, et al., Improving large language models for clinical named entity recognition via prompt engineering, Journal of the American Medical Informatics Association 31 (2024) 1812–1820
work page 2024
-
[6]
LLMs in biomedicine: a study on clinical named entity recognition,
M. Monajatipoor, J. Yang, J. Stremmel, M. Emami, F. Mohaghegh, M. Rouhsedaghat, K.-W. Chang, Llms in biomedicine: A study on clinical named entity recognition, arXiv preprint arXiv:2404.07376 (2024)
- [7]
-
[8]
Á. García-Barragán, A. Sakor, M.-E. Vidal, E. Menasalvas, J. C. S. Gonzalez, M. Provencio, V. Robles, Nssc: a neuro-symbolic ai system for enhancing accuracy of named entity recognition and linking from oncologic clinical notes, Medical & Biological Engineering & Computing 63 (2025) 749–772
work page 2025
-
[9]
O. Rohanian, M. Nouriborji, S. Kouchaki, F. Nooralahzadeh, L. Clifton, D. A. Clifton, Explor- ing the effectiveness of instruction tuning in biomedical language processing, Artificial Intel- ligence in Medicine 158 (2024) 103007. URL: https://www.sciencedirect.com/science/article/pii/ S0933365724002495. doi:https://doi.org/10.1016/j.artmed.2024.103007
- [10]
-
[11]
Q. Lu, R. Li, A. Wen, J. Wang, L. Wang, H. Liu, Large language models struggle in token-level clinical named entity recognition, in: AMIA Annual Symposium Proceedings, volume 2024, 2025, p. 748
work page 2024
-
[12]
O. Bodenreider, The unified medical language system (UMLS): integrating biomedical terminology, Nucleic Acids Res. 32 (2004) D267–70
work page 2004
-
[13]
K. B. Cohen, K. Verspoor, K. Fort, C. Funk, M. Bada, M. Palmer, L. E. Hunter, The colorado richly annotated full text (craft) corpus: Multi-model annotation in the biomedical domain, Handbook of Linguistic annotation (2017) 1379–1394
work page 2017
- [14]
-
[15]
S. Lima-López, W. Alnabki, G. Vayá-Abad, M. Krallinger, Toxhabits-ner: A gold-standard annotated dataset for named entity recognition in toxic habits context, 2025. URL: https://doi.org/10.5281/ zenodo.15538314. doi:10.5281/zenodo.15538314
-
[16]
O. Khattab, A. Singhvi, P. Maheshwari, Z. Zhang, K. Santhanam, S. Vardhamanan, S. Haq, A. Sharma, T. T. Joshi, H. Moazam, H. Miller, M. Zaharia, C. Potts, Dspy: Compiling declarative language model calls into self-improving pipelines, 2024
work page 2024
- [17]
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.