Recognition: unknown
RadLite: Multi-Task LoRA Fine-Tuning of Small Language Models for CPU-Deployable Radiology AI
Pith reviewed 2026-05-09 19:18 UTC · model grok-4.3
The pith
Small language models fine-tuned with LoRA achieve strong multi-task radiology performance and run on consumer CPUs without GPUs.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
LoRA fine-tuning of the Qwen2.5-3B-Instruct and Qwen3-4B models on 162K radiology samples produces accuracy gains of 53 percent on RADS classification, 60 percent on natural language inference, and 89 percent on N-staging relative to zero-shot baselines. The two models display complementary strengths that an oracle ensemble exploits for best results across all tasks. Fine-tuned models can be converted to GGUF format and run at 4-8 tokens per second on consumer CPUs, while few-shot prompting after fine-tuning actually lowers performance.
What carries the argument
LoRA fine-tuning applied to 3-4B parameter language models to adapt them simultaneously to nine radiology tasks, followed by quantization for CPU inference.
If this is right
- The fine-tuned models can be deployed in clinics that lack GPU hardware or internet access.
- Combining the two models via an ensemble yields the highest scores on every task.
- Parameter-efficient adaptation outperforms in-context few-shot examples for these specialized medical tasks.
- Quantized models occupy roughly 2 GB and deliver inference speeds sufficient for interactive use on laptops.
Where Pith is reading between the lines
- The same compilation and fine-tuning approach could be applied to other medical report domains such as pathology or cardiology with only modest additional data collection.
- Running inference entirely on local hardware reduces the risk of sending protected patient data to external servers.
- Future experiments could test whether even smaller models under 2B parameters retain acceptable performance after similar training.
Load-bearing premise
The 162K samples drawn from twelve public datasets and the up-to-500 held-out test samples per task reflect the variety and difficulty of real clinical radiology reports and images.
What would settle it
Running the same models on a fresh set of 500 real hospital radiology reports and images that were never part of any public dataset and finding accuracy below 60 percent on at least three of the nine tasks would show the results do not generalize.
Figures
read the original abstract
Large language models (LLMs) show promise in radiology but their deployment is limited by computational requirements that preclude use in resource-constrained clinical environments. We investigate whether small language models (SLMs) of 3-4 billion parameters can achieve strong multi-task radiology performance through LoRA fine-tuning, enabling deployment on consumer-grade CPUs. We train Qwen2.5-3B-Instruct and Qwen3-4B on 162K samples spanning 9 radiology tasks - RADS classification across 10 systems, impression generation, temporal comparison, radiology NLI, NER, abnormality detection, N/M staging, and radiology Q&A - compiled from 12 public datasets. Both models are evaluated on up to 500 held-out test samples per task with standardized metrics. Our key findings are: (1) LoRA fine-tuning dramatically improves performance over zero-shot baselines (RADS accuracy +53%, NLI +60%, N-staging +89%); (2) the two models exhibit complementary strengths - Qwen2.5 excels at structured generation tasks while Qwen3 dominates extractive tasks; (3) a task-outed oracle ensemble combining both models achieves the best performance across all tasks; (4) few-shot prompting with fine-tuned models hurts performance, demonstrating that LoRA adaptation is more effective than in-context learning for specialized domains; and (5) models can be quantized to GGUF format (~1.8-2.4GB) for CPU deployment at 4-8 tokens/second on consumer hardware. Our work demonstrates that small, efficiently fine-tuned models - which we collectively call RadLite - can serve as practical multi-task radiology AI assistants deployable entirely on consumer hardware without GPU requirements. Code and models are available at https://github.com/RadioX-Labs/RadLite
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces RadLite, a collection of LoRA fine-tuned 3-4B parameter models (Qwen2.5-3B-Instruct and Qwen3-4B) trained on 162K samples compiled from 12 public radiology datasets spanning 9 tasks (RADS classification, impression generation, temporal comparison, NLI, NER, abnormality detection, N/M staging, and Q&A). It reports large gains over zero-shot baselines (e.g., +53% RADS accuracy, +60% NLI, +89% N-staging), complementary strengths between the two models, superiority of LoRA adaptation over few-shot prompting, an oracle ensemble that combines them, and successful GGUF quantization enabling 4-8 tokens/second inference on consumer CPUs without GPUs.
Significance. If the quantitative results and deployment claims hold after detailed scrutiny, the work would be significant for demonstrating that small, efficiently adapted models can handle diverse radiology tasks at practical speeds on consumer hardware. This could lower barriers to AI assistance in resource-constrained clinical settings. The open release of code and models, the multi-task compilation, and the observation of task complementarity are positive contributions that support reproducibility and further research in domain-specific SLM adaptation.
major comments (2)
- Abstract and §4 (Results): The large reported gains (+53% RADS accuracy, +60% NLI, +89% N-staging) are stated without specifying the exact evaluation metrics (accuracy vs. F1 vs. other), the precise zero-shot baseline configurations, per-task test sample counts and splits, or any statistical significance testing or confidence intervals. These details are load-bearing for interpreting whether the improvements are robust or potentially inflated by evaluation choices.
- §4 (Evaluation) and §5 (Discussion): All performance numbers are measured on held-out samples drawn from the same 12 public datasets used for training. No external clinical validation set, multi-institutional test data, or out-of-distribution evaluation is described. This directly weakens the central claim that the models constitute 'practical' multi-task radiology AI assistants for real-world clinical deployment, where reporting styles, dictation errors, and institutional variability differ from curated public corpora.
minor comments (2)
- Abstract: The phrase 'task-outed oracle ensemble' is nonstandard and undefined; it should be clarified (likely meaning an oracle that picks the stronger model per task) with a brief description of how the oracle is constructed.
- §3 (Methods): The distribution of the 162K training samples across the 9 tasks and 12 datasets is not summarized (e.g., via a table of sample counts per task). This makes it difficult to assess task balance and potential dominance by larger datasets.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback, which has helped us improve the clarity and transparency of our work. We address each major comment below and have revised the manuscript accordingly.
read point-by-point responses
-
Referee: Abstract and §4 (Results): The large reported gains (+53% RADS accuracy, +60% NLI, +89% N-staging) are stated without specifying the exact evaluation metrics (accuracy vs. F1 vs. other), the precise zero-shot baseline configurations, per-task test sample counts and splits, or any statistical significance testing or confidence intervals. These details are load-bearing for interpreting whether the improvements are robust or potentially inflated by evaluation choices.
Authors: We agree that additional methodological detail is necessary for proper interpretation. In the revised manuscript, we have expanded the abstract and §4 to specify the exact metrics for each task (accuracy for RADS classification, N/M staging, and NLI; F1-score for NER and abnormality detection; ROUGE-L and BLEU for impression generation and temporal comparison). The zero-shot baselines are now explicitly defined as the unmodified base Qwen2.5-3B-Instruct and Qwen3-4B models with direct prompting and no in-context examples. We report the precise held-out test sizes (200–500 samples per task) and the dataset-specific train/test splits. We have also added 95% bootstrap confidence intervals for all reported metrics and noted statistically significant improvements (p < 0.05 via paired t-tests). These changes directly address the concern about potential inflation due to evaluation choices. revision: yes
-
Referee: §4 (Evaluation) and §5 (Discussion): All performance numbers are measured on held-out samples drawn from the same 12 public datasets used for training. No external clinical validation set, multi-institutional test data, or out-of-distribution evaluation is described. This directly weakens the central claim that the models constitute 'practical' multi-task radiology AI assistants for real-world clinical deployment, where reporting styles, dictation errors, and institutional variability differ from curated public corpora.
Authors: The referee is correct that our evaluations are confined to held-out splits from the same public corpora. We have revised §5 to include an expanded limitations paragraph that explicitly discusses this constraint, the risks of domain shift from institutional reporting differences and dictation noise, and the need for prospective clinical validation. We have also softened language around immediate 'practical' deployment to emphasize the work as a proof-of-concept for CPU-deployable multi-task radiology SLMs. However, because the study relies exclusively on publicly released datasets, we do not have access to external multi-institutional or prospective clinical data and therefore cannot supply such validation results. revision: partial
- We do not possess external clinical or multi-institutional validation data and cannot perform the requested out-of-distribution evaluation.
Circularity Check
No circularity; purely empirical evaluation on held-out data
full rationale
The paper compiles 162K samples from 12 public datasets, applies LoRA fine-tuning to 3-4B models, and reports performance metrics on up to 500 held-out test samples per task against zero-shot baselines. No equations, parameter fits presented as predictions, self-citations, uniqueness theorems, or ansatzes appear in the provided text. All claims (accuracy gains, model complementarity, CPU deployment speeds) are direct experimental measurements on the chosen splits, not reductions by construction. This is standard supervised ML evaluation and remains self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption LoRA fine-tuning can substantially adapt small language models to specialized medical domains without catastrophic forgetting
Reference graph
Works this paper leans on
-
[1]
J., Shen, Y ., Wallis, P., Allen-Zhu, Z., Li, Y ., Wang, S., Wang, L., & Chen, W
Hu, E. J., Shen, Y ., Wallis, P., Allen-Zhu, Z., Li, Y ., Wang, S., Wang, L., & Chen, W. (2022). LoRA: Low-Rank Adaptation of Large Language Models.International Conference on Learning Representations (ICLR)
2022
-
[2]
Yang, A., Yang, B., et al. (2025). Qwen2.5 Technical Report.arXiv preprint arXiv:2412.15115
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[3]
Yang, A., et al. (2025). Qwen3 Technical Report.arXiv preprint arXiv:2505.09388
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[4]
Singhal, K., Azizi, S., Tu, T., et al. (2023). Large Language Models Encode Clinical Knowledge. Nature, 620, 172–180
2023
-
[5]
& Topol, E
Mesk ´o, B. & Topol, E. J. (2023). The Imperative for Regulatory Oversight of Large Language Models (or Generative AI) in Healthcare.npj Digital Medicine, 6, 120
2023
- [6]
-
[7]
Delbrouck, J.-B., Chambon, P., Chen, Z., Varma, M., Johnston, A., Blankemeier, L., Van Veen, D., et al. (2024). RadGraph-XL: A Large-Scale Expert-Annotated Dataset for Entity and Relation Extraction from Radiology Reports.Findings of the Association for Computational Linguistics (ACL 2024), 12902–12915. 16
2024
- [8]
-
[9]
Smit, A., Jain, S., Rajpurkar, P., Pareek, A., Ng, A. Y ., & Lungren, M. P. (2020). CheXbert: Combining Automatic Labelers and Expert Annotations for Accurate Radiology Report Labeling Using BERT.Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). arXiv:2004.09167
-
[10]
Miura, Y ., et al. (2021). Improving Factual Completeness and Consistency of Image-to-Text Radi- ology Report Generation.Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL)
2021
-
[11]
Johnson, A. E. W., et al. (2019). MIMIC-CXR, a De-identified Publicly Available Database of Chest Radiographs with Free-Text Reports.Scientific Data, 6, 317
2019
-
[12]
E.et al.Generalist foundation models from a multimodal dataset for 3d computed tomography
Hamamci, I. E., et al. (2025). Generalist Foundation Models from a Multimodal Dataset for 3D Computed Tomography.Nature Biomedical Engineering. DOI: 10.1038/s41551-025-01599-y. arXiv:2403.17834
- [13]
- [14]
-
[15]
Jin, D., et al. (2021). What Disease does this Patient Have? A Large-scale Open Domain Question Answering Dataset from Medical Exams.Applied Sciences, 11(14), 6421
2021
-
[16]
Dettmers, T., Pagnoni, A., Holtzman, A., & Zettlemoyer, L. (2023). QLoRA: Efficient Finetuning of Quantized LLMs.Advances in Neural Information Processing Systems (NeurIPS)
2023
- [17]
- [18]
-
[19]
Abdin, M., et al. (2024). Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone.arXiv preprint arXiv:2404.14219
work page internal anchor Pith review arXiv 2024
-
[20]
Gemma 2: Improving Open Language Models at a Practical Size
Gemma Team (2024). Gemma 2: Improving Open Language Models at a Practical Size.arXiv preprint arXiv:2408.00118
work page internal anchor Pith review arXiv 2024
-
[21]
Huang, C., et al. (2024). LoRAHub: Efficient Cross-Task Generalization via Dynamic LoRA Com- position.Proceedings of COLM
2024
-
[22]
Frantar, E., et al. (2023). GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers.International Conference on Learning Representations (ICLR)
2023
-
[23]
Lin, J., et al. (2024). AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration.arXiv preprint arXiv:2306.00978
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[24]
GGUF: GGML Universal File Format for Large Language Models
GGML Contributors (2023). GGUF: GGML Universal File Format for Large Language Models. https://github.com/ggerganov/ggml. 17
2023
-
[25]
Lin, C.-Y . (2004). ROUGE: A Package for Automatic Evaluation of Summaries.Text Summariza- tion Branches Out, 74–81
2004
-
[26]
Zhao, Z., Wallace, E., Feng, S., Klein, D., & Singh, S. (2021). Calibrate Before Use: Improv- ing Few-Shot Performance of Language Models.Proceedings of the International Conference on Machine Learning (ICML), 12697–12706
2021
-
[27]
Poliak, A., Naradowsky, J., Haldar, A., Rudinger, R., & Van Durme, B. (2018). Hypothesis Only Baselines in Natural Language Inference.Proceedings of the Seventh Joint Conference on Lexical and Computational Semantics (*SEM 2018), 180–191
2018
-
[28]
J., Sickles, E
D’Orsi, C. J., Sickles, E. A., Mendelson, E. B., & Morris, E. A. (Eds.). (2013). ACR BI-RADS ® Atlas, Breast Imaging Reporting and Data System (5th ed.).American College of Radiology
2013
-
[29]
Bannur, S., et al. (2023). Learning to Exploit Temporal Structure for Biomedical Vision-Language Processing.Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion (CVPR), 15016–15027
2023
-
[30]
D., & Langlotz, C
Zhang, Y ., Ding, D., Qian, T., Manning, C. D., & Langlotz, C. P. (2018). Learning to Summarize Radiology Findings.Proceedings of the LOUHI Workshop at EMNLP, 204–213
2018
-
[31]
Jain, S., Agrawal, A., Saporta, A., Truong, S. Q. H., Duong, D. N., Bui, T., Chambon, P., Zhang, Y ., Lungren, M. P., Ng, A. Y ., Langlotz, C. P., & Rajpurkar, P. (2021). RadGraph: Extracting Clinical Entities and Relations from Radiology Reports.Proceedings of the Neural Information Processing Systems (NeurIPS) Datasets and Benchmarks Track
2021
-
[32]
Medalpaca–an open-source collection of medical conversational ai models and training data
Han, T., Adams, L. C., et al. (2023). MedAlpaca — An Open-Source Collection of Medical Con- versational AI Models and Training Data.arXiv preprint arXiv:2304.08247
-
[33]
J., Ting, D
Thirunavukarasu, A. J., Ting, D. S. J., et al. (2023). Large language models in medicine.Nature Medicine, 29, 1930–1940
2023
-
[34]
P., Liu, J., Liu, L., Van Veen, D., Gardezi, S
Blankemeier, L., Kumar, A., Cohen, J. P., Liu, J., Liu, L., Van Veen, D., Gardezi, S. J., Yu, H., Paschali, M., Chen, Z., & Delbrouck, J.-B. (2026). Merlin: a computed tomography vision- language foundation model and dataset.Nature. 2026 Mar 4:1–1. 18
2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.