MOSAIC: A Multilingual, Taxonomy-Agnostic, and Computationally Efficient Approach for Radiological Report Classification
Pith reviewed 2026-05-21 22:38 UTC · model grok-4.3
The pith
MOSAIC classifies radiology reports in multiple languages and taxonomies using a compact open model that matches expert accuracy with low computing needs.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
MOSAIC is a method built on the MedGemma-4B model that performs radiological report classification without depending on any fixed label taxonomy or single language. It works through zero-shot or few-shot prompting as well as lightweight fine-tuning, reaching a mean macro F1 of 88 across five chest X-ray datasets in English, Spanish, French, and Danish while using only 24 GB of GPU memory. With data augmentation the same approach attains a weighted F1 of 82 on Danish reports using just 80 annotated samples versus 86 with the full training set of 1600 samples.
What carries the argument
MOSAIC, the prompting and fine-tuning framework on the compact MedGemma-4B model that treats label taxonomies as interchangeable rather than fixed.
If this is right
- Classification becomes feasible with as few as 80 annotated samples for languages such as Danish while preserving high weighted F1 scores.
- The system runs on consumer-grade hardware because it requires only 24 GB of GPU memory.
- It handles reports in English, Spanish, French, and Danish across multiple imaging modalities and label taxonomies in the tested collections.
- Open-source release of code and models allows direct use and extension by clinical teams without proprietary tools.
- Performance approaches or exceeds expert-level results on chest X-ray report classification tasks.
Where Pith is reading between the lines
- Existing radiology reports could supply automatic labels for training medical imaging models at scale without manual annotation campaigns.
- The same lightweight adaptation steps might transfer to report classification in other medical domains such as pathology notes or discharge summaries.
- Wider testing on entirely new modalities or label structures would show how far the taxonomy-agnostic property extends in practice.
- Adoption could lower barriers to using report-derived supervision in settings that currently rely on closed-source large language models.
Load-bearing premise
The prompting and fine-tuning strategies tested on the seven datasets will keep comparable accuracy when applied to radiology reports that use new label taxonomies, different imaging modalities, or unfamiliar clinical writing styles.
What would settle it
Apply MOSAIC without further adaptation to a new radiology report collection that uses a previously unseen label taxonomy or an additional language and measure whether the macro F1 score falls substantially below the reported 88.
Figures
read the original abstract
Radiology reports contain rich clinical information that can be used to train imaging models without relying on costly manual annotation. However, existing approaches face critical limitations: rule-based methods struggle with linguistic variability, supervised models require large annotated datasets, and recent LLM-based systems depend on closed-source or resource-intensive models that are unsuitable for clinical use. Moreover, current solutions are largely restricted to English and single-modality, single-taxonomy datasets. We introduce MOSAIC, a multilingual, taxonomy-agnostic, and computationally efficient approach for radiological report classification. Built on a compact open-access language model (MedGemma-4B), MOSAIC supports both zero-/few-shot prompting and lightweight fine-tuning, enabling deployment on consumer-grade GPUs. We evaluate MOSAIC across seven datasets in English, Spanish, French, and Danish, spanning multiple imaging modalities and label taxonomies. The model achieves a mean macro F1 score of 88 across five chest X-ray datasets, approaching or exceeding expert-level performance, while requiring only 24 GB of GPU memory. With data augmentation, as few as 80 annotated samples are sufficient to reach a weighted F1 score of 82 on Danish reports, compared to 86 with the full 1600-sample training set. MOSAIC offers a practical alternative to large or proprietary LLMs in clinical settings. Code and models are open-source. We invite the community to evaluate and extend MOSAIC on new languages, taxonomies, and modalities.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces MOSAIC, a multilingual, taxonomy-agnostic approach for radiological report classification built on the compact open-access MedGemma-4B model. It supports zero-/few-shot prompting and lightweight fine-tuning for deployment on consumer GPUs, and reports evaluation across seven datasets in English, Spanish, French, and Danish spanning multiple modalities and label taxonomies. Central results include a mean macro F1 of 88 on five chest X-ray datasets (approaching expert-level performance) with 24 GB GPU memory, plus sample efficiency where data augmentation allows 80 annotated Danish samples to reach weighted F1 of 82 versus 86 with the full 1600-sample set. Code and models are released as open source.
Significance. If the performance numbers hold under scrutiny, the work offers a practical, low-resource alternative to closed-source or large LLMs for clinical report classification. This could meaningfully advance the use of radiology reports to train imaging models without expensive manual annotation, particularly in multilingual and multi-taxonomy settings. Strengths include the explicit open-sourcing of code/models and the focus on consumer-grade hardware (24 GB), which directly addresses deployment barriers in clinical environments.
major comments (3)
- [Abstract / Results] Abstract and Results: The headline mean macro F1 of 88 across five chest X-ray datasets is presented as an aggregate without per-dataset macro F1 values, baseline comparisons, or statistical significance tests. This aggregate alone cannot confirm consistent superiority or rule out dataset-specific effects that would weaken the central performance claim.
- [Methods] Methods: The taxonomy-agnostic property is asserted via prompting and fine-tuning on MedGemma-4B, but the manuscript provides insufficient detail on how label sets are mapped or adapted across taxonomies. Without this, the claim that the approach generalizes to unseen taxonomies without substantial additional adaptation remains unverified and load-bearing for the broader applicability argument.
- [Evaluation / Experiments] Evaluation: The sample-efficiency result (weighted F1 82 with 80 augmented Danish samples) is promising, yet the manuscript lacks an ablation isolating the contribution of data augmentation versus the base prompting/fine-tuning strategy, and does not report variance across multiple runs or seeds. This weakens confidence in the data-efficiency conclusion.
minor comments (2)
- [Abstract] The abstract states 'approaching or exceeding expert-level performance' without defining the expert baseline or citing the specific human-performance numbers being compared against.
- [Datasets] Dataset statistics (number of reports, label distributions, train/test splits) should be summarized in a table early in the paper to allow readers to contextualize the reported F1 scores.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and constructive feedback. We address each major comment below and describe the specific revisions we will incorporate to strengthen the manuscript.
read point-by-point responses
-
Referee: [Abstract / Results] Abstract and Results: The headline mean macro F1 of 88 across five chest X-ray datasets is presented as an aggregate without per-dataset macro F1 values, baseline comparisons, or statistical significance tests. This aggregate alone cannot confirm consistent superiority or rule out dataset-specific effects that would weaken the central performance claim.
Authors: We agree that the aggregate mean alone is insufficient to fully substantiate the central claim. In the revised manuscript we will add a table in the Results section that reports macro F1 for each of the five chest X-ray datasets individually, includes the relevant baseline comparisons, and presents statistical significance tests (paired t-tests across datasets where appropriate). This will allow readers to evaluate consistency directly. revision: yes
-
Referee: [Methods] Methods: The taxonomy-agnostic property is asserted via prompting and fine-tuning on MedGemma-4B, but the manuscript provides insufficient detail on how label sets are mapped or adapted across taxonomies. Without this, the claim that the approach generalizes to unseen taxonomies without substantial additional adaptation remains unverified and load-bearing for the broader applicability argument.
Authors: We acknowledge that the current Methods section would benefit from greater explicitness. We will expand the description to include concrete examples of prompt construction for arbitrary label sets, the exact mapping procedure used during inference, and how the fine-tuning objective is formulated to accommodate new taxonomies. These additions will make the generalization mechanism verifiable. revision: yes
-
Referee: [Evaluation / Experiments] Evaluation: The sample-efficiency result (weighted F1 82 with 80 augmented Danish samples) is promising, yet the manuscript lacks an ablation isolating the contribution of data augmentation versus the base prompting/fine-tuning strategy, and does not report variance across multiple runs or seeds. This weakens confidence in the data-efficiency conclusion.
Authors: We agree that an ablation and variance reporting would increase confidence. In the revision we will add an ablation that isolates the effect of data augmentation while holding the prompting and fine-tuning strategy fixed. We will also rerun the Danish sample-efficiency experiments across five random seeds and report means with standard deviations. revision: yes
Circularity Check
No circularity: empirical results on held-out test sets
full rationale
The paper introduces MOSAIC as a practical LLM-based classifier and reports mean macro F1 of 88 on five chest X-ray datasets plus a few-shot result on Danish reports. These are direct empirical measurements obtained by running the model on independent, held-out test splits from external datasets. No derivation chain, equations, or first-principles predictions are claimed; performance numbers are not obtained by fitting a parameter inside the same model and then re-using that fit as a 'prediction.' Self-citations, if present, are not load-bearing for the central performance claim. The evaluation is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
free parameters (1)
- Number of few-shot examples or fine-tuning samples
axioms (2)
- domain assumption MedGemma-4B possesses sufficient multilingual and domain knowledge to classify radiology reports via prompting or light fine-tuning
- domain assumption Performance observed on the seven evaluated datasets and four languages generalizes to new taxonomies and modalities
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
MOSAIC supports both zero-/few-shot prompting and lightweight fine-tuning... mean macro F1 score of 88 across five chest X-ray datasets
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Built on a compact open-access language model (MedGemma-4B)... consumer-grade GPUs
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Alabi, Yanke Mao, Haonan Gao, and Annie En-Shiun Lee
David Ifeoluwa Adelani, Hannah Liu, Xiaoyu Shen, Nikita Vassilyev, Jesujoba O. Alabi, Yanke Mao, Haonan Gao, and Annie En-Shiun Lee. 2024. https://arxiv.org/abs/2309.07445 Sib-200: A simple, inclusive, and big evaluation dataset for topic classification in 200+ languages and dialects . Preprint, arXiv:2309.07445
-
[2]
Wattjes, Jawed Nawabi, Marcus R
Fares Al Mohamad , Leonhard Donle, Felix Dorfner, Laura Romanescu, Kristin Drechsler, Mike P. Wattjes, Jawed Nawabi, Marcus R. Makowski, Hartmut Häntze, Lisa Adams, Lina Xu, Felix Busch, Aymen Meddeb, and Keno Kyrill Bressem. 2025. https://doi.org/10.1016/j.acra.2024.12.028 Open-source large language models can generate labels from radiology reports for t...
-
[3]
Anonymous. 2025. https://doi.org/10.3390/ai6020037 Effective machine learning techniques for non-english radiology report classification: A danish case study . AI, 6(2)
-
[4]
Satanjeev Banerjee and Alon Lavie. 2005. Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pages 65--72
work page 2005
- [5]
-
[6]
Lukas Biewald. 2020. https://www.wandb.com/ Experiment tracking with weights and biases . Software available from wandb.com
work page 2020
-
[7]
Ricardo Bigolin Lanfredi, Mingyuan Zhang, William F Auffermann, Jessica Chan, Phuong-Anh T Duong, Vivek Srikumar, Trafton Drew, Joyce D Schroeder, and Tolga Tasdizen. 2022. Reflacx, a dataset of reports and eye-tracking data for localization of abnormalities in chest x-rays. Scientific data, 9(1):350
work page 2022
-
[8]
Daniel C Castro, Aurelia Bustos, Shruthi Bannur, Stephanie L Hyland, Kenza Bouzid, Maria Teodora Wetscherek, Maria Dolores S \'a nchez-Valverde, Lara Jaques-P \'e rez, Lourdes P \'e rez-Rodr \' guez, Kenji Takeda, and 1 others. 2024. Padchest-gr: A bilingual chest x-ray dataset for grounded radiology report generation. arXiv preprint arXiv:2411.05085
-
[9]
Jaime Collado-Monta \ n ez, Mar \' a-Teresa Mart \' n-Valdivia, and Eugenio Mart \' nez-C \'a mara. 2025. Data augmentation based on large language models for radiological report classification. Knowledge-Based Systems, 308:112745
work page 2025
-
[10]
Xiang Dai, Ilias Chalkidis, Sune Darkner, and Desmond Elliott. 2022. https://doi.org/10.18653/v1/2022.findings-emnlp.534 Revisiting transformer-based models for long document classification . In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 7212--7230, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics
-
[11]
Felix J Dorfner, Liv J \"u rgensen, Leonhard Donle, Fares Al Mohamad, Tobias R Bodenmann, Mason C Cleveland, Felix Busch, Lisa C Adams, James Sato, Thomas Schultz, and 1 others. 2024. Is open-source there yet? a comparative study on commercial and open-source llms in their ability to label chest x-ray reports. arXiv preprint arXiv:2402.12298
-
[12]
Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, and 1 others. 2024. The llama 3 herd of models. arXiv preprint arXiv:2407.21783
work page internal anchor Pith review Pith/arXiv arXiv 2024
- [13]
-
[14]
Daniel Han, Michael Han, and Unsloth team. 2023. Unsloth. http://github.com/unslothai/unsloth. Software
work page 2023
-
[15]
Jeremy Irvin, Pranav Rajpurkar, Michael Ko, Yifan Yu, Silviana Ciurea-Ilcus, Chris Chute, Henrik Marklund, Behzad Haghgoo, Robyn Ball, Katie Shpanskaya, and 1 others. 2019. Chexpert: A large chest radiograph dataset with uncertainty labels and expert comparison. In Proceedings of the AAAI conference on artificial intelligence, volume 33, pages 590--597
work page 2019
-
[16]
Alistair Johnson, Matt Lungren, Yifan Peng, Zhiyong Lu, Roger Mark, Seth Berkowitz, and Steven Horng. 2019. Mimic-cxr-jpg-chest radiographs with structured labels. PhysioNet, 101:215--220
work page 2019
-
[17]
Gonzalez, Hao Zhang, and Ion Stoica
Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. 2023. Efficient memory management for large language model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles
work page 2023
-
[18]
Hidetoshi Matsuo, Mizuho Nishio, Takaaki Matsunaga, Koji Fujimoto, and Takamichi Murakami. 2024. https://doi.org/10.3390/cancers16213621 Exploring multilingual large language models for enhanced tnm classification of radiology report in lung cancer staging . Cancers, 16(21):3621
-
[19]
Markus Mergen, Daniel Spitzl, Conrad Ketzer, Maximilian Strenzke, Alexander W. Marka, Marcus R. Makowski, Keno K. Bressem, Lisa C. Adams, and Florian T. Gassert. 2025. https://doi.org/10.1007/s10278-025-01603-6 Leveraging large language models for accurate ao fracture classification from ct text reports . Journal of Imaging Informatics in Medicine
-
[20]
Hichem Metmer and Xiaoshan Yang. 2024. An open chest x-ray dataset with benchmarks for automatic radiology report generation in french. Neurocomputing, 609:128478
work page 2024
-
[21]
Luc Mottin, Jean-Philippe Goldman, Christoph J \"a ggli, Rita Achermann, Julien Gobeill, Julien Knafou, Julien Ehrsam, Alexandre Wicky, Camille L G \'e rard, Tanja Schwenk, and 1 others. 2023. Multilingual recist classification of radiology reports using supervised learning. Frontiers in digital health, 5:1195017
work page 2023
-
[22]
Thao Nguyen, Tam M Vo, Thang V Nguyen, Hieu H Pham, and Ha Q Nguyen. 2022. Learning to diagnose common thorax diseases on chest radiographs from radiology reports in vietnamese. Plos one, 17(10):e0276545
work page 2022
-
[23]
Matteo Olivato, Luca Putelli, Nicola Arici, Alfonso Emilio Gerevini, Alberto Lavelli, and Ivan Serina. 2024. https://doi.org/10.1109/ACCESS.2024.3402066 Language models for hierarchical classification of radiology reports with attention mechanisms, bert, and gpt-4 . IEEE Access, 12:69710--69727
-
[24]
David M Panicek and Hedvig Hricak. 2016. How sure are you, doctor? a standardized lexicon to describe the radiologist's level of certainty. American Journal of Roentgenology, 207(1):2--3
work page 2016
-
[25]
Pengcheng Qiu, Chaoyi Wu, Xiaoman Zhang, Weixiong Lin, Haicheng Wang, Ya Zhang, Yanfeng Wang, and Weidi Xie. 2024. Towards building multilingual language model for medicine. Nature Communications, 15(1):8384
work page 2024
-
[26]
Daniel Reichenpfader, Henning M\" u ller, and Kerstin Denecke. 2024. https://doi.org/10.1038/s41746-024-01219-0 A scoping review of large language model based approaches for information extraction from radiology reports . npj Digital Medicine, 7(1)
-
[27]
Eduardo P Reis, Joselisa PQ De Paiva, Maria CB Da Silva, Guilherme AS Ribeiro, Victor F Paiva, Lucas Bulgarelli, Henrique MH Lee, Paulo V Santos, Vanessa M Brito, Lucas TW Amaral, and 1 others. 2022. Brax, brazilian labeled chest x-ray dataset. Scientific Data, 9(1):487
work page 2022
-
[28]
Konstantinos Sechidis, Grigorios Tsoumakas, and Ioannis Vlahavas. 2011. On the stratification of multi-label data. In Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2011, Athens, Greece, September 5-9, 2011, Proceedings, Part III 22, pages 145--158. Springer
work page 2011
-
[29]
Akshay Smit, Saahil Jain, Pranav Rajpurkar, Anuj Pareek, Andrew Y. Ng, and Matthew P. Lungren. 2020. https://arxiv.org/abs/2004.09167 Chexbert: Combining automatic labelers and expert annotations for accurate radiology report labeling using BERT . CoRR, abs/2004.09167
-
[30]
Kazufumi Suzuki, Hiroki Yamada, Hiroshi Yamazaki, Goro Honda, and Shuji Sakai. 2024. https://doi.org/10.1007/s11604-024-01643-y Preliminary assessment of tnm classification performance for pancreatic cancer in japanese radiology reports using gpt-4 . Japanese Journal of Radiology, 43(1):51–55
-
[31]
Gemma Team, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ram \'e , Morgane Rivi \`e re, and 1 others. 2025. Gemma 3 technical report. arXiv preprint arXiv:2503.19786
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[32]
o Fo - Fortschritte auf dem Gebiet der R\
Alessandro Wollek, Sardi Hyska, Thomas Sedlmeyr, Philip Haitzer, Johannes Rueckel, Bastian O. Sabel, Michael Ingrisch, and Tobias Lasser. 2024. https://doi.org/10.1055/a-2234-8268 German chexpert chest x-ray radiology report labeler . R\" o Fo - Fortschritte auf dem Gebiet der R\" o ntgenstrahlen und der bildgebenden Verfahren , 196(09):956–965
-
[33]
Eric Yang, Matthew D Li, Shruti Raghavan, Francis Deng, Min Lang, Marc D Succi, Ambrose J Huang, and Jayashree Kalpathy-Cramer. 2023. Transformer versus traditional natural language processing: how much data is enough for automated radiology report classification? The British Journal of Radiology, 96(1149):20220769
work page 2023
-
[34]
Shilong Zhang, Peize Sun, Shoufa Chen, Min Xiao, Wenqi Shao, Wenwei Zhang, Yu Liu, Kai Chen, and Ping Luo. 2025. Gpt4roi: Instruction tuning large language model on region-of-interest. In European Conference on Computer Vision, pages 52--70. Springer
work page 2025
-
[35]
Yirong Zhou, Paul K Amundson, Fangsheng Yu, Matthew M Kessler, Tammie L S Benzinger, and Franz J Wippold. 2014. https://doi.org/10.1007/s10278-014-9708-x Automated classification of radiology reports to facilitate retrospective study in radiology . Journal of Digital Imaging, 27(6):730--736
-
[36]
ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...
-
[37]
" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.