MOSAIC: A Multilingual, Taxonomy-Agnostic, and Computationally Efficient Approach for Radiological Report Classification

Alice Schiavone; Desmond Elliott; Lea Marie Pehrson; Marco Fraccaro; Melanie Ganz; Michael Bachmann Nielsen; Rasmus Bonnevie; Silvia Ingala; Vincent Beliveau

arxiv: 2509.04471 · v2 · pith:ZHE66ECFnew · submitted 2025-08-29 · 💻 cs.CL · cs.AI

MOSAIC: A Multilingual, Taxonomy-Agnostic, and Computationally Efficient Approach for Radiological Report Classification

Alice Schiavone , Marco Fraccaro , Lea Marie Pehrson , Silvia Ingala , Rasmus Bonnevie , Michael Bachmann Nielsen , Vincent Beliveau , Melanie Ganz

show 1 more author

Desmond Elliott

This is my paper

Pith reviewed 2026-05-21 22:38 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords radiology report classificationmultilingual medical NLPfew-shot promptinglightweight fine-tuningopen language modelsclinical text analysischest X-ray reportstaxonomy-agnostic methods

0 comments

The pith

MOSAIC classifies radiology reports in multiple languages and taxonomies using a compact open model that matches expert accuracy with low computing needs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces MOSAIC to classify the clinical details contained in radiology reports even when those reports come from different languages or use different category systems. It relies on a small publicly available language model that supports both quick prompting with few or no examples and light adjustment on new data, all while running on ordinary graphics hardware. Evaluations across seven datasets show strong results, including a mean macro F1 score of 88 on five chest X-ray collections, with performance holding up when only 80 labeled examples are available for a new language. A sympathetic reader would care because this removes the need for large annotated sets or expensive proprietary systems and could let existing reports supply labels for training imaging models. The work positions itself as a practical, open alternative for clinical settings that must handle linguistic and labeling variety.

Core claim

MOSAIC is a method built on the MedGemma-4B model that performs radiological report classification without depending on any fixed label taxonomy or single language. It works through zero-shot or few-shot prompting as well as lightweight fine-tuning, reaching a mean macro F1 of 88 across five chest X-ray datasets in English, Spanish, French, and Danish while using only 24 GB of GPU memory. With data augmentation the same approach attains a weighted F1 of 82 on Danish reports using just 80 annotated samples versus 86 with the full training set of 1600 samples.

What carries the argument

MOSAIC, the prompting and fine-tuning framework on the compact MedGemma-4B model that treats label taxonomies as interchangeable rather than fixed.

If this is right

Classification becomes feasible with as few as 80 annotated samples for languages such as Danish while preserving high weighted F1 scores.
The system runs on consumer-grade hardware because it requires only 24 GB of GPU memory.
It handles reports in English, Spanish, French, and Danish across multiple imaging modalities and label taxonomies in the tested collections.
Open-source release of code and models allows direct use and extension by clinical teams without proprietary tools.
Performance approaches or exceeds expert-level results on chest X-ray report classification tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Existing radiology reports could supply automatic labels for training medical imaging models at scale without manual annotation campaigns.
The same lightweight adaptation steps might transfer to report classification in other medical domains such as pathology notes or discharge summaries.
Wider testing on entirely new modalities or label structures would show how far the taxonomy-agnostic property extends in practice.
Adoption could lower barriers to using report-derived supervision in settings that currently rely on closed-source large language models.

Load-bearing premise

The prompting and fine-tuning strategies tested on the seven datasets will keep comparable accuracy when applied to radiology reports that use new label taxonomies, different imaging modalities, or unfamiliar clinical writing styles.

What would settle it

Apply MOSAIC without further adaptation to a new radiology report collection that uses a previously unseen label taxonomy or an additional language and measure whether the macro F1 score falls substantially below the reported 88.

Figures

Figures reproduced from arXiv: 2509.04471 by Alice Schiavone, Desmond Elliott, Lea Marie Pehrson, Marco Fraccaro, Melanie Ganz, Michael Bachmann Nielsen, Rasmus Bonnevie, Silvia Ingala, Vincent Beliveau.

**Figure 1.** Figure 1: Performance as (+)F1 score of MedGemma-4B, Llama-8B, and Gemma-12B fine-tuned on MPE+S C, with detailed results on three key Chest X-ray pathologies: Cardiomegaly, Pneumothorax, and Pleural Effusion. The data is presented across a range of chest X-ray datasets, illustrating model-specific and dataset-specific performance. On these common findings, all tested models have similar results, also generalizing o… view at source ↗

**Figure 2.** Figure 2: Evaluation of different pretraining and fine [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 4.** Figure 4: Performance of MOSAIC on the DanskMRI dataset, measured as (+)F1 across three epilepsyrelated abnormalities from MRI reports. including MIMIC, PadChest, and CASIA. Unlike existing approaches, MOSAIC is flexible across languages and label taxonomies, while remaining efficient enough to operate on standard consumer hardware. We evaluate MOSAIC on radiology reports in English, Spanish, French, and Danish acr… view at source ↗

**Figure 5.** Figure 5: Distribution of labels in the chest X-ray [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗

**Figure 6.** Figure 6: Perplexity on the SIB-200 dataset across [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗

**Figure 7.** Figure 7: Distribution of (+)F1 scores from [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗

**Figure 8.** Figure 8: Percentage of invalid outputs when testing each model setting in Table [PITH_FULL_IMAGE:figures/full_fig_p013_8.png] view at source ↗

**Figure 9.** Figure 9: Prediction mismatches on DanskCXR: MedGemma-4B (684 mismatches; 6.5%) vs. Gemma-12B (580 mismatches; 5.5%). astinum and normal heart size. (Translated from Danish) Classification comparison, where (-) is the "not mentioned" class: Finding Ground Truth Gemma-12B MedGemma-4B Atelectasis 2 2 2 Cardiomegaly 2 2 2 Infiltrate 1 1 1 LungDecr.Transl. 1 1 1 PleuralEffusion 2 2 2 Pneumothorax 2 2 2 StasisEdema 2 2 … view at source ↗

read the original abstract

Radiology reports contain rich clinical information that can be used to train imaging models without relying on costly manual annotation. However, existing approaches face critical limitations: rule-based methods struggle with linguistic variability, supervised models require large annotated datasets, and recent LLM-based systems depend on closed-source or resource-intensive models that are unsuitable for clinical use. Moreover, current solutions are largely restricted to English and single-modality, single-taxonomy datasets. We introduce MOSAIC, a multilingual, taxonomy-agnostic, and computationally efficient approach for radiological report classification. Built on a compact open-access language model (MedGemma-4B), MOSAIC supports both zero-/few-shot prompting and lightweight fine-tuning, enabling deployment on consumer-grade GPUs. We evaluate MOSAIC across seven datasets in English, Spanish, French, and Danish, spanning multiple imaging modalities and label taxonomies. The model achieves a mean macro F1 score of 88 across five chest X-ray datasets, approaching or exceeding expert-level performance, while requiring only 24 GB of GPU memory. With data augmentation, as few as 80 annotated samples are sufficient to reach a weighted F1 score of 82 on Danish reports, compared to 86 with the full 1600-sample training set. MOSAIC offers a practical alternative to large or proprietary LLMs in clinical settings. Code and models are open-source. We invite the community to evaluate and extend MOSAIC on new languages, taxonomies, and modalities.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces MOSAIC, a multilingual, taxonomy-agnostic approach for radiological report classification built on the compact open-access MedGemma-4B model. It supports zero-/few-shot prompting and lightweight fine-tuning for deployment on consumer GPUs, and reports evaluation across seven datasets in English, Spanish, French, and Danish spanning multiple modalities and label taxonomies. Central results include a mean macro F1 of 88 on five chest X-ray datasets (approaching expert-level performance) with 24 GB GPU memory, plus sample efficiency where data augmentation allows 80 annotated Danish samples to reach weighted F1 of 82 versus 86 with the full 1600-sample set. Code and models are released as open source.

Significance. If the performance numbers hold under scrutiny, the work offers a practical, low-resource alternative to closed-source or large LLMs for clinical report classification. This could meaningfully advance the use of radiology reports to train imaging models without expensive manual annotation, particularly in multilingual and multi-taxonomy settings. Strengths include the explicit open-sourcing of code/models and the focus on consumer-grade hardware (24 GB), which directly addresses deployment barriers in clinical environments.

major comments (3)

[Abstract / Results] Abstract and Results: The headline mean macro F1 of 88 across five chest X-ray datasets is presented as an aggregate without per-dataset macro F1 values, baseline comparisons, or statistical significance tests. This aggregate alone cannot confirm consistent superiority or rule out dataset-specific effects that would weaken the central performance claim.
[Methods] Methods: The taxonomy-agnostic property is asserted via prompting and fine-tuning on MedGemma-4B, but the manuscript provides insufficient detail on how label sets are mapped or adapted across taxonomies. Without this, the claim that the approach generalizes to unseen taxonomies without substantial additional adaptation remains unverified and load-bearing for the broader applicability argument.
[Evaluation / Experiments] Evaluation: The sample-efficiency result (weighted F1 82 with 80 augmented Danish samples) is promising, yet the manuscript lacks an ablation isolating the contribution of data augmentation versus the base prompting/fine-tuning strategy, and does not report variance across multiple runs or seeds. This weakens confidence in the data-efficiency conclusion.

minor comments (2)

[Abstract] The abstract states 'approaching or exceeding expert-level performance' without defining the expert baseline or citing the specific human-performance numbers being compared against.
[Datasets] Dataset statistics (number of reports, label distributions, train/test splits) should be summarized in a table early in the paper to allow readers to contextualize the reported F1 scores.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and constructive feedback. We address each major comment below and describe the specific revisions we will incorporate to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract / Results] Abstract and Results: The headline mean macro F1 of 88 across five chest X-ray datasets is presented as an aggregate without per-dataset macro F1 values, baseline comparisons, or statistical significance tests. This aggregate alone cannot confirm consistent superiority or rule out dataset-specific effects that would weaken the central performance claim.

Authors: We agree that the aggregate mean alone is insufficient to fully substantiate the central claim. In the revised manuscript we will add a table in the Results section that reports macro F1 for each of the five chest X-ray datasets individually, includes the relevant baseline comparisons, and presents statistical significance tests (paired t-tests across datasets where appropriate). This will allow readers to evaluate consistency directly. revision: yes
Referee: [Methods] Methods: The taxonomy-agnostic property is asserted via prompting and fine-tuning on MedGemma-4B, but the manuscript provides insufficient detail on how label sets are mapped or adapted across taxonomies. Without this, the claim that the approach generalizes to unseen taxonomies without substantial additional adaptation remains unverified and load-bearing for the broader applicability argument.

Authors: We acknowledge that the current Methods section would benefit from greater explicitness. We will expand the description to include concrete examples of prompt construction for arbitrary label sets, the exact mapping procedure used during inference, and how the fine-tuning objective is formulated to accommodate new taxonomies. These additions will make the generalization mechanism verifiable. revision: yes
Referee: [Evaluation / Experiments] Evaluation: The sample-efficiency result (weighted F1 82 with 80 augmented Danish samples) is promising, yet the manuscript lacks an ablation isolating the contribution of data augmentation versus the base prompting/fine-tuning strategy, and does not report variance across multiple runs or seeds. This weakens confidence in the data-efficiency conclusion.

Authors: We agree that an ablation and variance reporting would increase confidence. In the revision we will add an ablation that isolates the effect of data augmentation while holding the prompting and fine-tuning strategy fixed. We will also rerun the Danish sample-efficiency experiments across five random seeds and report means with standard deviations. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical results on held-out test sets

full rationale

The paper introduces MOSAIC as a practical LLM-based classifier and reports mean macro F1 of 88 on five chest X-ray datasets plus a few-shot result on Danish reports. These are direct empirical measurements obtained by running the model on independent, held-out test splits from external datasets. No derivation chain, equations, or first-principles predictions are claimed; performance numbers are not obtained by fitting a parameter inside the same model and then re-using that fit as a 'prediction.' Self-citations, if present, are not load-bearing for the central performance claim. The evaluation is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The approach rests on the pre-trained capabilities of MedGemma-4B and standard prompting/fine-tuning practices rather than new mathematical derivations; no new entities are postulated.

free parameters (1)

Number of few-shot examples or fine-tuning samples
The abstract demonstrates results with as few as 80 samples after augmentation, implying this quantity is chosen or tuned for the reported efficiency claims.

axioms (2)

domain assumption MedGemma-4B possesses sufficient multilingual and domain knowledge to classify radiology reports via prompting or light fine-tuning
The method assumes the base model already encodes the necessary clinical and linguistic patterns without providing new evidence for this capability.
domain assumption Performance observed on the seven evaluated datasets and four languages generalizes to new taxonomies and modalities
The taxonomy-agnostic claim depends on this transfer assumption.

pith-pipeline@v0.9.0 · 5827 in / 1294 out tokens · 52127 ms · 2026-05-21T22:38:05.402860+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

MOSAIC supports both zero-/few-shot prompting and lightweight fine-tuning... mean macro F1 score of 88 across five chest X-ray datasets
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Built on a compact open-access language model (MedGemma-4B)... consumer-grade GPUs

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

37 extracted references · 37 canonical work pages · 2 internal anchors

[1]

Alabi, Yanke Mao, Haonan Gao, and Annie En-Shiun Lee

David Ifeoluwa Adelani, Hannah Liu, Xiaoyu Shen, Nikita Vassilyev, Jesujoba O. Alabi, Yanke Mao, Haonan Gao, and Annie En-Shiun Lee. 2024. https://arxiv.org/abs/2309.07445 Sib-200: A simple, inclusive, and big evaluation dataset for topic classification in 200+ languages and dialects . Preprint, arXiv:2309.07445

work page arXiv 2024
[2]

Wattjes, Jawed Nawabi, Marcus R

Fares Al Mohamad , Leonhard Donle, Felix Dorfner, Laura Romanescu, Kristin Drechsler, Mike P. Wattjes, Jawed Nawabi, Marcus R. Makowski, Hartmut Häntze, Lisa Adams, Lina Xu, Felix Busch, Aymen Meddeb, and Keno Kyrill Bressem. 2025. https://doi.org/10.1016/j.acra.2024.12.028 Open-source large language models can generate labels from radiology reports for t...

work page doi:10.1016/j.acra.2024.12.028 2025
[3]

Anonymous. 2025. https://doi.org/10.3390/ai6020037 Effective machine learning techniques for non-english radiology report classification: A danish case study . AI, 6(2)

work page doi:10.3390/ai6020037 2025
[4]

Satanjeev Banerjee and Alon Lavie. 2005. Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pages 65--72

work page 2005
[5]

Vincent Beliveau, Helene Kaas, Martin Prener, Claes N Ladefoged, Desmond Elliott, Gitte M Knudsen, Lars H Pinborg, and Melanie Ganz. 2024. Classification of radiological text in small and imbalanced datasets in a non-english language. arXiv preprint arXiv:2409.20147

work page arXiv 2024
[6]

Lukas Biewald. 2020. https://www.wandb.com/ Experiment tracking with weights and biases . Software available from wandb.com

work page 2020
[7]

Ricardo Bigolin Lanfredi, Mingyuan Zhang, William F Auffermann, Jessica Chan, Phuong-Anh T Duong, Vivek Srikumar, Trafton Drew, Joyce D Schroeder, and Tolga Tasdizen. 2022. Reflacx, a dataset of reports and eye-tracking data for localization of abnormalities in chest x-rays. Scientific data, 9(1):350

work page 2022
[8]

Daniel C Castro, Aurelia Bustos, Shruthi Bannur, Stephanie L Hyland, Kenza Bouzid, Maria Teodora Wetscherek, Maria Dolores S \'a nchez-Valverde, Lara Jaques-P \'e rez, Lourdes P \'e rez-Rodr \' guez, Kenji Takeda, and 1 others. 2024. Padchest-gr: A bilingual chest x-ray dataset for grounded radiology report generation. arXiv preprint arXiv:2411.05085

work page arXiv 2024
[9]

Jaime Collado-Monta \ n ez, Mar \' a-Teresa Mart \' n-Valdivia, and Eugenio Mart \' nez-C \'a mara. 2025. Data augmentation based on large language models for radiological report classification. Knowledge-Based Systems, 308:112745

work page 2025
[10]

Xiang Dai, Ilias Chalkidis, Sune Darkner, and Desmond Elliott. 2022. https://doi.org/10.18653/v1/2022.findings-emnlp.534 Revisiting transformer-based models for long document classification . In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 7212--7230, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics

work page doi:10.18653/v1/2022.findings-emnlp.534 2022
[11]

Felix J Dorfner, Liv J \"u rgensen, Leonhard Donle, Fares Al Mohamad, Tobias R Bodenmann, Mason C Cleveland, Felix Busch, Lisa C Adams, James Sato, Thomas Schultz, and 1 others. 2024. Is open-source there yet? a comparative study on commercial and open-source llms in their ability to label chest x-ray reports. arXiv preprint arXiv:2402.12298

work page arXiv 2024
[12]

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, and 1 others. 2024. The llama 3 herd of models. arXiv preprint arXiv:2407.21783

work page internal anchor Pith review Pith/arXiv arXiv 2024
[13]

Jawook Gu, Han-Cheol Cho, Jiho Kim, Kihyun You, Eun Kyoung Hong, and Byungseok Roh. 2024. Chex-gpt: Harnessing large language models for enhanced chest x-ray report labeling. arXiv preprint arXiv:2401.11505

work page arXiv 2024
[14]

Daniel Han, Michael Han, and Unsloth team. 2023. Unsloth. http://github.com/unslothai/unsloth. Software

work page 2023
[15]

Jeremy Irvin, Pranav Rajpurkar, Michael Ko, Yifan Yu, Silviana Ciurea-Ilcus, Chris Chute, Henrik Marklund, Behzad Haghgoo, Robyn Ball, Katie Shpanskaya, and 1 others. 2019. Chexpert: A large chest radiograph dataset with uncertainty labels and expert comparison. In Proceedings of the AAAI conference on artificial intelligence, volume 33, pages 590--597

work page 2019
[16]

Alistair Johnson, Matt Lungren, Yifan Peng, Zhiyong Lu, Roger Mark, Seth Berkowitz, and Steven Horng. 2019. Mimic-cxr-jpg-chest radiographs with structured labels. PhysioNet, 101:215--220

work page 2019
[17]

Gonzalez, Hao Zhang, and Ion Stoica

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. 2023. Efficient memory management for large language model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles

work page 2023
[18]

Hidetoshi Matsuo, Mizuho Nishio, Takaaki Matsunaga, Koji Fujimoto, and Takamichi Murakami. 2024. https://doi.org/10.3390/cancers16213621 Exploring multilingual large language models for enhanced tnm classification of radiology report in lung cancer staging . Cancers, 16(21):3621

work page doi:10.3390/cancers16213621 2024
[19]

Marka, Marcus R

Markus Mergen, Daniel Spitzl, Conrad Ketzer, Maximilian Strenzke, Alexander W. Marka, Marcus R. Makowski, Keno K. Bressem, Lisa C. Adams, and Florian T. Gassert. 2025. https://doi.org/10.1007/s10278-025-01603-6 Leveraging large language models for accurate ao fracture classification from ct text reports . Journal of Imaging Informatics in Medicine

work page doi:10.1007/s10278-025-01603-6 2025
[20]

Hichem Metmer and Xiaoshan Yang. 2024. An open chest x-ray dataset with benchmarks for automatic radiology report generation in french. Neurocomputing, 609:128478

work page 2024
[21]

Luc Mottin, Jean-Philippe Goldman, Christoph J \"a ggli, Rita Achermann, Julien Gobeill, Julien Knafou, Julien Ehrsam, Alexandre Wicky, Camille L G \'e rard, Tanja Schwenk, and 1 others. 2023. Multilingual recist classification of radiology reports using supervised learning. Frontiers in digital health, 5:1195017

work page 2023
[22]

Thao Nguyen, Tam M Vo, Thang V Nguyen, Hieu H Pham, and Ha Q Nguyen. 2022. Learning to diagnose common thorax diseases on chest radiographs from radiology reports in vietnamese. Plos one, 17(10):e0276545

work page 2022
[23]

Matteo Olivato, Luca Putelli, Nicola Arici, Alfonso Emilio Gerevini, Alberto Lavelli, and Ivan Serina. 2024. https://doi.org/10.1109/ACCESS.2024.3402066 Language models for hierarchical classification of radiology reports with attention mechanisms, bert, and gpt-4 . IEEE Access, 12:69710--69727

work page doi:10.1109/access.2024.3402066 2024
[24]

David M Panicek and Hedvig Hricak. 2016. How sure are you, doctor? a standardized lexicon to describe the radiologist's level of certainty. American Journal of Roentgenology, 207(1):2--3

work page 2016
[25]

Pengcheng Qiu, Chaoyi Wu, Xiaoman Zhang, Weixiong Lin, Haicheng Wang, Ya Zhang, Yanfeng Wang, and Weidi Xie. 2024. Towards building multilingual language model for medicine. Nature Communications, 15(1):8384

work page 2024
[26]

Daniel Reichenpfader, Henning M\" u ller, and Kerstin Denecke. 2024. https://doi.org/10.1038/s41746-024-01219-0 A scoping review of large language model based approaches for information extraction from radiology reports . npj Digital Medicine, 7(1)

work page doi:10.1038/s41746-024-01219-0 2024
[27]

Eduardo P Reis, Joselisa PQ De Paiva, Maria CB Da Silva, Guilherme AS Ribeiro, Victor F Paiva, Lucas Bulgarelli, Henrique MH Lee, Paulo V Santos, Vanessa M Brito, Lucas TW Amaral, and 1 others. 2022. Brax, brazilian labeled chest x-ray dataset. Scientific Data, 9(1):487

work page 2022
[28]

Konstantinos Sechidis, Grigorios Tsoumakas, and Ioannis Vlahavas. 2011. On the stratification of multi-label data. In Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2011, Athens, Greece, September 5-9, 2011, Proceedings, Part III 22, pages 145--158. Springer

work page 2011
[29]

Y.; and Lungren, M

Akshay Smit, Saahil Jain, Pranav Rajpurkar, Anuj Pareek, Andrew Y. Ng, and Matthew P. Lungren. 2020. https://arxiv.org/abs/2004.09167 Chexbert: Combining automatic labelers and expert annotations for accurate radiology report labeling using BERT . CoRR, abs/2004.09167

work page arXiv 2020
[30]

Kazufumi Suzuki, Hiroki Yamada, Hiroshi Yamazaki, Goro Honda, and Shuji Sakai. 2024. https://doi.org/10.1007/s11604-024-01643-y Preliminary assessment of tnm classification performance for pancreatic cancer in japanese radiology reports using gpt-4 . Japanese Journal of Radiology, 43(1):51–55

work page doi:10.1007/s11604-024-01643-y 2024
[31]

Gemma Team, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ram \'e , Morgane Rivi \`e re, and 1 others. 2025. Gemma 3 technical report. arXiv preprint arXiv:2503.19786

work page internal anchor Pith review Pith/arXiv arXiv 2025
[32]

o Fo - Fortschritte auf dem Gebiet der R\

Alessandro Wollek, Sardi Hyska, Thomas Sedlmeyr, Philip Haitzer, Johannes Rueckel, Bastian O. Sabel, Michael Ingrisch, and Tobias Lasser. 2024. https://doi.org/10.1055/a-2234-8268 German chexpert chest x-ray radiology report labeler . R\" o Fo - Fortschritte auf dem Gebiet der R\" o ntgenstrahlen und der bildgebenden Verfahren , 196(09):956–965

work page doi:10.1055/a-2234-8268 2024
[33]

Eric Yang, Matthew D Li, Shruti Raghavan, Francis Deng, Min Lang, Marc D Succi, Ambrose J Huang, and Jayashree Kalpathy-Cramer. 2023. Transformer versus traditional natural language processing: how much data is enough for automated radiology report classification? The British Journal of Radiology, 96(1149):20220769

work page 2023
[34]

Shilong Zhang, Peize Sun, Shoufa Chen, Min Xiao, Wenqi Shao, Wenwei Zhang, Yu Liu, Kai Chen, and Ping Luo. 2025. Gpt4roi: Instruction tuning large language model on region-of-interest. In European Conference on Computer Vision, pages 52--70. Springer

work page 2025
[35]

Yirong Zhou, Paul K Amundson, Fangsheng Yu, Matthew M Kessler, Tammie L S Benzinger, and Franz J Wippold. 2014. https://doi.org/10.1007/s10278-014-9708-x Automated classification of radiology reports to facilitate retrospective study in radiology . Journal of Digital Imaging, 27(6):730--736

work page doi:10.1007/s10278-014-9708-x 2014
[36]

online" 'onlinestring :=

ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...

work page
[37]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

work page

[1] [1]

Alabi, Yanke Mao, Haonan Gao, and Annie En-Shiun Lee

David Ifeoluwa Adelani, Hannah Liu, Xiaoyu Shen, Nikita Vassilyev, Jesujoba O. Alabi, Yanke Mao, Haonan Gao, and Annie En-Shiun Lee. 2024. https://arxiv.org/abs/2309.07445 Sib-200: A simple, inclusive, and big evaluation dataset for topic classification in 200+ languages and dialects . Preprint, arXiv:2309.07445

work page arXiv 2024

[2] [2]

Wattjes, Jawed Nawabi, Marcus R

Fares Al Mohamad , Leonhard Donle, Felix Dorfner, Laura Romanescu, Kristin Drechsler, Mike P. Wattjes, Jawed Nawabi, Marcus R. Makowski, Hartmut Häntze, Lisa Adams, Lina Xu, Felix Busch, Aymen Meddeb, and Keno Kyrill Bressem. 2025. https://doi.org/10.1016/j.acra.2024.12.028 Open-source large language models can generate labels from radiology reports for t...

work page doi:10.1016/j.acra.2024.12.028 2025

[3] [3]

Anonymous. 2025. https://doi.org/10.3390/ai6020037 Effective machine learning techniques for non-english radiology report classification: A danish case study . AI, 6(2)

work page doi:10.3390/ai6020037 2025

[4] [4]

Satanjeev Banerjee and Alon Lavie. 2005. Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pages 65--72

work page 2005

[5] [5]

Vincent Beliveau, Helene Kaas, Martin Prener, Claes N Ladefoged, Desmond Elliott, Gitte M Knudsen, Lars H Pinborg, and Melanie Ganz. 2024. Classification of radiological text in small and imbalanced datasets in a non-english language. arXiv preprint arXiv:2409.20147

work page arXiv 2024

[6] [6]

Lukas Biewald. 2020. https://www.wandb.com/ Experiment tracking with weights and biases . Software available from wandb.com

work page 2020

[7] [7]

Ricardo Bigolin Lanfredi, Mingyuan Zhang, William F Auffermann, Jessica Chan, Phuong-Anh T Duong, Vivek Srikumar, Trafton Drew, Joyce D Schroeder, and Tolga Tasdizen. 2022. Reflacx, a dataset of reports and eye-tracking data for localization of abnormalities in chest x-rays. Scientific data, 9(1):350

work page 2022

[8] [8]

Daniel C Castro, Aurelia Bustos, Shruthi Bannur, Stephanie L Hyland, Kenza Bouzid, Maria Teodora Wetscherek, Maria Dolores S \'a nchez-Valverde, Lara Jaques-P \'e rez, Lourdes P \'e rez-Rodr \' guez, Kenji Takeda, and 1 others. 2024. Padchest-gr: A bilingual chest x-ray dataset for grounded radiology report generation. arXiv preprint arXiv:2411.05085

work page arXiv 2024

[9] [9]

Jaime Collado-Monta \ n ez, Mar \' a-Teresa Mart \' n-Valdivia, and Eugenio Mart \' nez-C \'a mara. 2025. Data augmentation based on large language models for radiological report classification. Knowledge-Based Systems, 308:112745

work page 2025

[10] [10]

Xiang Dai, Ilias Chalkidis, Sune Darkner, and Desmond Elliott. 2022. https://doi.org/10.18653/v1/2022.findings-emnlp.534 Revisiting transformer-based models for long document classification . In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 7212--7230, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics

work page doi:10.18653/v1/2022.findings-emnlp.534 2022

[11] [11]

Felix J Dorfner, Liv J \"u rgensen, Leonhard Donle, Fares Al Mohamad, Tobias R Bodenmann, Mason C Cleveland, Felix Busch, Lisa C Adams, James Sato, Thomas Schultz, and 1 others. 2024. Is open-source there yet? a comparative study on commercial and open-source llms in their ability to label chest x-ray reports. arXiv preprint arXiv:2402.12298

work page arXiv 2024

[12] [12]

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, and 1 others. 2024. The llama 3 herd of models. arXiv preprint arXiv:2407.21783

work page internal anchor Pith review Pith/arXiv arXiv 2024

[13] [13]

Jawook Gu, Han-Cheol Cho, Jiho Kim, Kihyun You, Eun Kyoung Hong, and Byungseok Roh. 2024. Chex-gpt: Harnessing large language models for enhanced chest x-ray report labeling. arXiv preprint arXiv:2401.11505

work page arXiv 2024

[14] [14]

Daniel Han, Michael Han, and Unsloth team. 2023. Unsloth. http://github.com/unslothai/unsloth. Software

work page 2023

[15] [15]

Jeremy Irvin, Pranav Rajpurkar, Michael Ko, Yifan Yu, Silviana Ciurea-Ilcus, Chris Chute, Henrik Marklund, Behzad Haghgoo, Robyn Ball, Katie Shpanskaya, and 1 others. 2019. Chexpert: A large chest radiograph dataset with uncertainty labels and expert comparison. In Proceedings of the AAAI conference on artificial intelligence, volume 33, pages 590--597

work page 2019

[16] [16]

Alistair Johnson, Matt Lungren, Yifan Peng, Zhiyong Lu, Roger Mark, Seth Berkowitz, and Steven Horng. 2019. Mimic-cxr-jpg-chest radiographs with structured labels. PhysioNet, 101:215--220

work page 2019

[17] [17]

Gonzalez, Hao Zhang, and Ion Stoica

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. 2023. Efficient memory management for large language model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles

work page 2023

[18] [18]

Hidetoshi Matsuo, Mizuho Nishio, Takaaki Matsunaga, Koji Fujimoto, and Takamichi Murakami. 2024. https://doi.org/10.3390/cancers16213621 Exploring multilingual large language models for enhanced tnm classification of radiology report in lung cancer staging . Cancers, 16(21):3621

work page doi:10.3390/cancers16213621 2024

[19] [19]

Marka, Marcus R

Markus Mergen, Daniel Spitzl, Conrad Ketzer, Maximilian Strenzke, Alexander W. Marka, Marcus R. Makowski, Keno K. Bressem, Lisa C. Adams, and Florian T. Gassert. 2025. https://doi.org/10.1007/s10278-025-01603-6 Leveraging large language models for accurate ao fracture classification from ct text reports . Journal of Imaging Informatics in Medicine

work page doi:10.1007/s10278-025-01603-6 2025

[20] [20]

Hichem Metmer and Xiaoshan Yang. 2024. An open chest x-ray dataset with benchmarks for automatic radiology report generation in french. Neurocomputing, 609:128478

work page 2024

[21] [21]

Luc Mottin, Jean-Philippe Goldman, Christoph J \"a ggli, Rita Achermann, Julien Gobeill, Julien Knafou, Julien Ehrsam, Alexandre Wicky, Camille L G \'e rard, Tanja Schwenk, and 1 others. 2023. Multilingual recist classification of radiology reports using supervised learning. Frontiers in digital health, 5:1195017

work page 2023

[22] [22]

Thao Nguyen, Tam M Vo, Thang V Nguyen, Hieu H Pham, and Ha Q Nguyen. 2022. Learning to diagnose common thorax diseases on chest radiographs from radiology reports in vietnamese. Plos one, 17(10):e0276545

work page 2022

[23] [23]

Matteo Olivato, Luca Putelli, Nicola Arici, Alfonso Emilio Gerevini, Alberto Lavelli, and Ivan Serina. 2024. https://doi.org/10.1109/ACCESS.2024.3402066 Language models for hierarchical classification of radiology reports with attention mechanisms, bert, and gpt-4 . IEEE Access, 12:69710--69727

work page doi:10.1109/access.2024.3402066 2024

[24] [24]

David M Panicek and Hedvig Hricak. 2016. How sure are you, doctor? a standardized lexicon to describe the radiologist's level of certainty. American Journal of Roentgenology, 207(1):2--3

work page 2016

[25] [25]

Pengcheng Qiu, Chaoyi Wu, Xiaoman Zhang, Weixiong Lin, Haicheng Wang, Ya Zhang, Yanfeng Wang, and Weidi Xie. 2024. Towards building multilingual language model for medicine. Nature Communications, 15(1):8384

work page 2024

[26] [26]

Daniel Reichenpfader, Henning M\" u ller, and Kerstin Denecke. 2024. https://doi.org/10.1038/s41746-024-01219-0 A scoping review of large language model based approaches for information extraction from radiology reports . npj Digital Medicine, 7(1)

work page doi:10.1038/s41746-024-01219-0 2024

[27] [27]

Eduardo P Reis, Joselisa PQ De Paiva, Maria CB Da Silva, Guilherme AS Ribeiro, Victor F Paiva, Lucas Bulgarelli, Henrique MH Lee, Paulo V Santos, Vanessa M Brito, Lucas TW Amaral, and 1 others. 2022. Brax, brazilian labeled chest x-ray dataset. Scientific Data, 9(1):487

work page 2022

[28] [28]

Konstantinos Sechidis, Grigorios Tsoumakas, and Ioannis Vlahavas. 2011. On the stratification of multi-label data. In Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2011, Athens, Greece, September 5-9, 2011, Proceedings, Part III 22, pages 145--158. Springer

work page 2011

[29] [29]

Y.; and Lungren, M

Akshay Smit, Saahil Jain, Pranav Rajpurkar, Anuj Pareek, Andrew Y. Ng, and Matthew P. Lungren. 2020. https://arxiv.org/abs/2004.09167 Chexbert: Combining automatic labelers and expert annotations for accurate radiology report labeling using BERT . CoRR, abs/2004.09167

work page arXiv 2020

[30] [30]

Kazufumi Suzuki, Hiroki Yamada, Hiroshi Yamazaki, Goro Honda, and Shuji Sakai. 2024. https://doi.org/10.1007/s11604-024-01643-y Preliminary assessment of tnm classification performance for pancreatic cancer in japanese radiology reports using gpt-4 . Japanese Journal of Radiology, 43(1):51–55

work page doi:10.1007/s11604-024-01643-y 2024

[31] [31]

Gemma Team, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ram \'e , Morgane Rivi \`e re, and 1 others. 2025. Gemma 3 technical report. arXiv preprint arXiv:2503.19786

work page internal anchor Pith review Pith/arXiv arXiv 2025

[32] [32]

o Fo - Fortschritte auf dem Gebiet der R\

Alessandro Wollek, Sardi Hyska, Thomas Sedlmeyr, Philip Haitzer, Johannes Rueckel, Bastian O. Sabel, Michael Ingrisch, and Tobias Lasser. 2024. https://doi.org/10.1055/a-2234-8268 German chexpert chest x-ray radiology report labeler . R\" o Fo - Fortschritte auf dem Gebiet der R\" o ntgenstrahlen und der bildgebenden Verfahren , 196(09):956–965

work page doi:10.1055/a-2234-8268 2024

[33] [33]

Eric Yang, Matthew D Li, Shruti Raghavan, Francis Deng, Min Lang, Marc D Succi, Ambrose J Huang, and Jayashree Kalpathy-Cramer. 2023. Transformer versus traditional natural language processing: how much data is enough for automated radiology report classification? The British Journal of Radiology, 96(1149):20220769

work page 2023

[34] [34]

Shilong Zhang, Peize Sun, Shoufa Chen, Min Xiao, Wenqi Shao, Wenwei Zhang, Yu Liu, Kai Chen, and Ping Luo. 2025. Gpt4roi: Instruction tuning large language model on region-of-interest. In European Conference on Computer Vision, pages 52--70. Springer

work page 2025

[35] [35]

Yirong Zhou, Paul K Amundson, Fangsheng Yu, Matthew M Kessler, Tammie L S Benzinger, and Franz J Wippold. 2014. https://doi.org/10.1007/s10278-014-9708-x Automated classification of radiology reports to facilitate retrospective study in radiology . Journal of Digital Imaging, 27(6):730--736

work page doi:10.1007/s10278-014-9708-x 2014

[36] [36]

online" 'onlinestring :=

ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...

work page

[37] [37]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

work page