pith. machine review for the scientific record. sign in

arxiv: 2604.19937 · v1 · submitted 2026-04-21 · 💻 cs.CV · cs.AI

Recognition: unknown

Infection-Reasoner: A Compact Vision-Language Model for Wound Infection Classification with Evidence-Grounded Clinical Reasoning

Authors on Pith no claims yet

Pith reviewed 2026-05-10 03:11 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords wound infection classificationvision-language modelclinical reasoningreasoning distillationreinforcement learningchronic wound imagesinterpretabilitymedical image analysis
0
0 comments X

The pith

A 4B-parameter vision-language model classifies chronic wound infections from photos at 86.8% accuracy and generates rationales that experts rate correct or partially correct 94% of the time.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Infection-Reasoner, a compact vision-language model designed to classify infections in chronic wound photographs while producing evidence-grounded clinical explanations. It tackles data scarcity through a two-stage process that first distills chain-of-thought rationales from GPT-5.1 on unlabeled images and then refines the model with reinforcement learning on a small labeled infection dataset. The resulting model is evaluated on a held-out heterogeneous collection of wound images varying in cause, location, and capture conditions. It reports higher accuracy, sensitivity, and specificity than several baselines including the larger GPT-5.1 teacher model. Separate assessments by multiple multimodal judges and wound experts confirm that the generated rationales are largely supported by visible image features.

Core claim

Infection-Reasoner achieves 86.8% accuracy, 86.4% sensitivity, and 87.1% specificity on a held-out heterogeneous wound dataset, outperforming GPT-5.1 and other baselines, while producing rationales that receive visual-support agreement scores of 0.722-0.903 from MLLM judges and are rated Correct by experts in 61.8% of cases and Partially Correct in 32.4% of cases.

What carries the argument

The two-stage training pipeline that first performs reasoning distillation of GPT-5.1 chain-of-thought rationales onto the Qwen3-VL-4B-Thinking student model and then applies Group Relative Policy Optimization reinforcement learning on a small labeled infection dataset.

If this is right

  • The model supplies both a classification decision and an explicit visual reasoning trace suitable for point-of-care review.
  • Performance exceeds that of the larger GPT-5.1 model despite using far fewer parameters.
  • Rationale quality remains high across heterogeneous wound etiologies, locations, and imaging conditions.
  • The pipeline reduces reliance on large volumes of expert-annotated reasoning data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same distillation-plus-RL recipe could be tested on other medical image classification tasks that currently lack reasoning annotations.
  • Deployment on mobile devices becomes feasible because the final model is only 4B parameters.
  • If rationale quality holds in prospective clinical use, the outputs could serve as training material for human clinicians.

Load-bearing premise

GPT-5.1-generated rationales on unlabeled wound images supply accurate and unbiased supervision that the small labeled dataset can then refine without inheriting teacher errors or causing overfitting.

What would settle it

Retraining the same 4B base model on the identical small labeled set but without the GPT-5.1 distillation stage and measuring whether accuracy falls below 86.8% or expert-rated rationale correctness drops below 60%.

Figures

Figures reproduced from arXiv: 2604.19937 by Bengisu Tulu, Deepak Kumar, Diane Strong, Emmanuel Agu, Palawat Busaranuvong, Reza Saadati Fard, Shefalika Gautam.

Figure 1
Figure 1. Figure 1: The Infection-Reasoner Two-stage Training Pipeline. [PITH_FULL_IMAGE:figures/full_fig_p006_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Reasoning prompt used for teacher and student CoT generation in Stage 1 and for student RL post-training in Stage 2. [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Example wound image augmentations used to increase appearance diversity while preserving clinical content. From [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Example Stage 1 pair: wound image (left) and teacher-generated rationale with final answer (right). [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Examples of labeled wound images used for RL post-training. [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative rationale comparison across models on the same wound image. [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Accuracy analysis across wound types. Comparison of classification accuracy between Infection-Reasoner and GPT-5.1 [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Distribution of per-image rationale-grounding agreement scores across four MLLM judges over five runs. For each [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Human expert evaluation of Infection-Reasoner rationales on the subset of cases where the wound expert agreed with [PITH_FULL_IMAGE:figures/full_fig_p020_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Expert-aligned correct-case example. Infection-Reasoner predicts [PITH_FULL_IMAGE:figures/full_fig_p020_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Example of a partially correct reasoning case. Infection-Reasoner correctly predicts the final label ( [PITH_FULL_IMAGE:figures/full_fig_p021_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Failure-mode example showing a completely incorrect Infection-Reasoner prediction and rationale. [PITH_FULL_IMAGE:figures/full_fig_p021_12.png] view at source ↗
read the original abstract

Assessing chronic wound infection from photographs is challenging because visual appearance varies across wound etiologies, anatomical locations, and imaging conditions. Prior image-based deep learning methods have mainly focused on classification with limited interpretability, despite the need for evidence-grounded explanations to support point-of-care decision making. We present Infection-Reasoner, a compact 4B-parameter reasoning vision-language model for chronic wound infection classification and rationale generation. To address the scarcity of expert-labeled wound images with reasoning annotations, Infection-Reasoner is trained using a two-stage pipeline: (1) reasoning distillation, in which GPT-5.1 generates chain-of-thought rationales for unlabeled wound images to initialize wound-specific reasoning in a smaller student model (Qwen3-VL-4B-Thinking), and (2) reinforcement learning post-training with Group Relative Policy Optimization on a small labeled infection dataset to refine classification reasoning. On a held-out heterogeneous wound dataset, Infection-Reasoner achieved 86.8\% accuracy, 86.4\% sensitivity, and 87.1\% specificity, outperforming several strong baselines, including GPT-5.1. Rationale quality was further evaluated using both multimodal large language model (MLLM) judges and wound expert review. Across four MLLM judges, visual-support agreement scores ranged from 0.722 to 0.903, while expert review rated 61.8\% of rationales as Correct and 32.4\% as Partially Correct.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The manuscript introduces Infection-Reasoner, a 4B-parameter vision-language model for chronic wound infection classification from photographs together with evidence-grounded rationale generation. Training proceeds in two stages: (1) reasoning distillation in which GPT-5.1 produces chain-of-thought rationales on unlabeled wound images to initialize a Qwen3-VL-4B-Thinking student, and (2) Group Relative Policy Optimization (GRPO) reinforcement learning on a small labeled infection dataset. On a held-out heterogeneous wound dataset the model reports 86.8% accuracy, 86.4% sensitivity and 87.1% specificity, outperforming several baselines including GPT-5.1. Rationale quality is assessed by four MLLM judges (visual-support agreement 0.722–0.903) and by wound-expert review (61.8% rated Correct, 32.4% Partially Correct).

Significance. If the empirical claims hold after the requested clarifications, the work supplies a compact, interpretable model that directly addresses the interpretability gap in prior image-only deep-learning wound classifiers. The distillation-plus-RL recipe for injecting clinical reasoning under limited labeled data is technically interesting and the reported outperformance of GPT-5.1 is a notable result. The combination of MLLM and expert rationale evaluation is a positive step toward grounded assessment.

major comments (2)
  1. [Abstract] Abstract: the reported performance figures (86.8% accuracy, 86.4% sensitivity, 87.1% specificity) and the claim of outperforming GPT-5.1 are presented without dataset size, diversity statistics, statistical significance tests, exact baseline implementations, or ablation results. These omissions are load-bearing for any claim that the two-stage pipeline yields reliable clinical reasoning.
  2. [Abstract] Abstract (two-stage pipeline description): no quantitative validation (expert agreement, error rate, or inter-rater reliability) is supplied for the GPT-5.1-generated chain-of-thought rationales on the unlabeled distillation images. Because the final expert review (61.8% Correct) occurs only after GRPO and does not isolate the distillation contribution, it is impossible to determine whether the reported rationale quality reflects genuine clinical reasoning or inherited teacher artifacts.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thorough and constructive review. We address each major comment below and have revised the manuscript to improve clarity and transparency regarding our experimental details and limitations.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the reported performance figures (86.8% accuracy, 86.4% sensitivity, 87.1% specificity) and the claim of outperforming GPT-5.1 are presented without dataset size, diversity statistics, statistical significance tests, exact baseline implementations, or ablation results. These omissions are load-bearing for any claim that the two-stage pipeline yields reliable clinical reasoning.

    Authors: We agree that the abstract, as a concise summary, omits supporting details that are important for evaluating the claims. The full manuscript provides the held-out dataset description (size, diversity across etiologies, anatomical locations, and imaging conditions) in the experimental setup, reports statistical significance tests for outperformance over baselines including GPT-5.1 in the results, details exact baseline implementations and hyperparameters in the methods, and presents ablation studies on the two-stage pipeline. To address the concern, we have revised the abstract to include a brief reference to dataset scale and statistical significance while maintaining length constraints. revision: partial

  2. Referee: [Abstract] Abstract (two-stage pipeline description): no quantitative validation (expert agreement, error rate, or inter-rater reliability) is supplied for the GPT-5.1-generated chain-of-thought rationales on the unlabeled distillation images. Because the final expert review (61.8% Correct) occurs only after GRPO and does not isolate the distillation contribution, it is impossible to determine whether the reported rationale quality reflects genuine clinical reasoning or inherited teacher artifacts.

    Authors: We acknowledge the value of isolating the distillation stage's contribution through separate quantitative validation of the GPT-5.1 rationales. This was not performed due to the substantial expert annotation costs involved. The expert review was conducted on the final post-GRPO outputs to evaluate the end-to-end system, while MLLM judge scores offer additional supporting evidence for rationale quality. In the revised manuscript, we have added a dedicated limitations paragraph in the discussion that explicitly notes this gap, explains how GRPO refines initial reasoning, and outlines plans for future work on granular distillation ablations. revision: yes

Circularity Check

0 steps flagged

No circularity: results from empirical training on external data with held-out evaluation

full rationale

The paper presents a standard two-stage ML pipeline (reasoning distillation from GPT-5.1 on unlabeled wound images, followed by GRPO RL refinement on a small labeled set) and reports accuracy/sensitivity/specificity on a held-out heterogeneous dataset plus independent rationale quality checks by MLLM judges and wound experts. No equations, fitted parameters renamed as predictions, self-definitional constructs, or load-bearing self-citations appear in the derivation chain. All performance numbers derive from external data splits and external judges rather than reducing to the training inputs by construction.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central performance and rationale-quality claims rest on the quality of synthetic reasoning data from GPT-5.1 and the representativeness of the small labeled wound dataset; neither is independently verified beyond the reported aggregate metrics.

free parameters (1)
  • GRPO hyperparameters
    Group Relative Policy Optimization settings are chosen during post-training but not quantified in the abstract.
axioms (1)
  • domain assumption GPT-5.1 chain-of-thought rationales on unlabeled wound images constitute high-quality supervision for clinical reasoning
    This premise underpins the entire first-stage distillation.

pith-pipeline@v0.9.0 · 5596 in / 1317 out tokens · 60817 ms · 2026-05-10T03:11:11.828887+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

78 extracted references · 18 canonical work pages · 9 internal anchors

  1. [1]

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023. Gpt-4 technical report.arXiv preprint arXiv:2303.08774(2023)

  2. [2]

    Nora Al-Garaawi, Raja Ebsim, Abbas FH Alharan, and Moi Hoon Yap. 2022. Diabetic foot ulcer classification using mapped binary patterns and convolutional neural networks.Computers in biology and medicine140 (2022), 105055

  3. [3]

    2024.The Claude 3 Model Family: Opus, Sonnet, Haiku

    Anthropic. 2024.The Claude 3 Model Family: Opus, Sonnet, Haiku. Technical Report. Anthropic. https://www-cdn.anthropic.com/ de8ba9b01c9ab7cbabf5c33b80b7bbc618857627/Model_Card_Claude_3.pdf Model card

  4. [4]

    Anthropic. 2026. Claude Sonnet 4.6 System Card. https://www-cdn.anthropic.com/bbd8ef16d70b7a1665f14f306ee88b53f686aa75.pdf. Technical report

  5. [5]

    Shuai Bai, Yulong Cai, Rui Chen, et al. 2025. Qwen3-VL technical report.arXiv preprint arXiv:2511.21631(2025)

  6. [6]

    Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Jauvin. 2003. A neural probabilistic language model.Journal of machine learning research3, Feb (2003), 1137–1155

  7. [7]

    Akhiad Bercovich, Itay Levy, Izik Golan, Mohammad Dabbah, Ran El-Yaniv, Omri Puny, Ido Galil, Zach Moshe, Tomer Ronen, Najeeb Nabwani, et al. 2025. Llama-nemotron: Efficient reasoning models.arXiv preprint arXiv:2505.00949(2025)

  8. [8]

    Steven Bowers and Eginia Franco. 2020. Chronic wounds: evaluation and management.American family physician101, 3 (2020), 159–166

  9. [9]

    Palawat Busaranuvong, Emmanuel Agu, Deepak Kumar, Shefalika Gautam, Reza Saadati Fard, Bengisu Tulu, and Diane Strong. 2025. Guided Conditional Diffusion Classifier (ConDiff) for Enhanced Prediction of Infection in Diabetic Foot Ulcers.IEEE Open Journal of Engineering in Medicine and Biology6 (2025), 20–27. doi:10.1109/OJEMB.2024.3453060

  10. [10]

    Palawat Busaranuvong, Emmanuel Agu, Reza Saadati Fard, Deepak Kumar, Shefalika Gautam, Bengisu Tulu, Diane Strong, and Lorraine Loretz. 2025. Explainable, multi-modal wound infection classification from images augmented with generated captions.ACM Transactions on Computing for Healthcare(2025)

  11. [11]

    Caroline Chanussot-Deprez and José Contreras-Ruiz. 2013. Telemedicine in wound care: a review.Advances in skin & wound care26, 2 (2013), 78–82

  12. [12]

    Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al . 2025. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261(2025)

  13. [13]

    Margaret A Fonder, Gerald S Lazarus, David A Cowan, Barbara Aronson-Cook, Angela R Kohli, and Adam J Mamelak. 2008. Treating the chronic wound: A practical approach to the care of nonhealing wounds and wound care dressings.Journal of the American Academy of Dermatology58, 2 (2008), 185–206. Infection-Reasoner: A Compact Vision-Language Model for Wound Inf...

  14. [14]

    Peter J Franks, Judith Barker, Mark Collier, Georgina Gethin, Emily Haesler, Arkadiusz Jawien, Severin Laeuchli, Giovanni Mosti, Sebastian Probst, and Carolina Weller. 2016. Management of patients with venous leg ulcers: challenges and current best practice. Journal of wound care25, Sup6 (2016), S1–S67

  15. [15]

    Adrian Galdran, Gustavo Carneiro, and Miguel A González Ballester. 2021. Convolutional nets versus vision transformers for diabetic foot ulcer classification. InDiabetic Foot Ulcers Grand Challenge. Springer, 21–29

  16. [16]

    Google. 2025. MedGemma: A Gemma 3 Variant Optimized for Medical Text and Image Comprehension. https://deepmind.google/ models/gemma/medgemma/

  17. [17]

    2025.Gemini 3 Pro Model Card

    Google DeepMind. 2025.Gemini 3 Pro Model Card. Technical Report. Google DeepMind. https://storage.googleapis.com/deepmind- media/Model-Cards/Gemini-3-Pro-Model-Card.pdf Accessed: 2026-03-11

  18. [18]

    Lisa Gould, Peter Abadir, Harold Brem, Marissa Carter, Teresa Conner-Kerr, Jeff Davidson, Luisa DiPietro, Vincent Falanga, Caroline Fife, Sue Gardner, et al. 2015. Chronic wound repair and healing in older adults: current status and future research.Wound Repair and Regeneration23, 1 (2015), 1–13

  19. [19]

    Manu Goyal, Neil D Reeves, Satyan Rajbhandari, Naseer Ahmad, Chuan Wang, and Moi Hoon Yap. 2020. Recognition of ischaemia and infection in diabetic foot ulcers: Dataset and techniques.Computers in biology and medicine117 (2020), 103616

  20. [20]

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. 2024. The llama 3 herd of models.arXiv preprint arXiv:2407.21783(2024)

  21. [21]

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al

  22. [22]

    Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948(2025)

  23. [23]

    Xuehai et al. He. 2020. PathVQA: 30000+ Questions for Medical Visual Question Answering. InAAAI

  24. [24]

    K Järbrink, G Ni, H Sönnergren, A Schmidtchen, C Pang, R Bajpai, and J Car. [n. d.]. The humanistic and economic burden of chronic wounds: a protocol for a systematic review. Syst Rev. 2017; 6 (1): 15

  25. [25]

    Qiao Jin, Fangyuan Chen, Yiliang Zhou, Ziyang Xu, Justin M Cheung, Robert Chen, Ronald M Summers, Justin F Rousseau, Peiyun Ni, Marc J Landsman, et al. 2024. Hidden flaws behind expert-level accuracy of multimodal GPT-4 vision in medicine.NPJ Digital Medicine 7, 1 (2024), 190

  26. [26]

    Robert Kaczmarczyk, Theresa Isabelle Wilhelm, Ron Martin, and Jonas Roos. 2024. Evaluating multimodal AI in medical diagnostics. NPJ Digital Medicine7, 1 (2024), 205

  27. [27]

    Yuxiang Lai, Jike Zhong, Ming Li, Shitian Zhao, Yuheng Li, Konstantinos Psounis, and Xiaofeng Yang. 2026. Med-r1: Reinforcement learning for generalizable medical reasoning in vision-language models.IEEE Transactions on Medical Imaging(2026)

  28. [28]

    Stephan J Landis. 2008. Chronic wound infection and antimicrobial use.Advances in skin & wound care21, 11 (2008), 531–540

  29. [29]

    Jason J Lau, Soumya Gayen, Asma Ben Abacha, and Dina Demner-Fushman. 2018. A dataset of clinically generated visual questions and answers about radiology images.Scientific data5, 1 (2018), 1–10

  30. [30]

    Chunyuan Li, Cliff Wong, Sheng Zhang, Naoto Usuyama, Haotian Liu, Jianwei Yang, Tristan Naumann, Hoifung Poon, and Jianfeng Gao

  31. [31]

    LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day.arXiv preprint arXiv:2306.00890(2023)

  32. [32]

    Liunian Harold Li, Jack Hessel, Youngjae Yu, Xiang Ren, Kai-Wei Chang, and Yejin Choi. 2023. Symbolic chain-of-thought distillation: Small models can also “think” step-by-step. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2665–2679

  33. [33]

    Benjamin A Lipsky, Anthony R Berendt, Paul B Cornia, John C Pile, Edgar J G Peters, David G Armstrong, H Gunner Deery, John M Embil, Warren S Joseph, Adolf W Karchmer, et al. 2012. 2012 infectious diseases society of america clinical practice guideline for the diagnosis and treatment of diabetic foot infections.Clinical Infectious Diseases54, 12 (2012), e132–e173

  34. [34]

    Benjamin A Lipsky, Matthew Dryden, Finn Gottrup, Dilip Nathwani, Ronald Andrew Seaton, and Jan Stryja. 2016. Antimicrobial stewardship in wound care: a position paper from the British Society for Antimicrobial Chemotherapy and European Wound Mgmt Assoc.J. Antimicrobial Chemotherapy71, 11 (2016), 3026–3035

  35. [35]

    Bo Liu, Li-Ming Zhan, Li Xu, Lin Ma, Yan Yang, and Xiao-Ming Wu. 2021. Slake: A semantically-labeled knowledge-enhanced dataset for medical visual question answering. In2021 IEEE 18th international symposium on biomedical imaging (ISBI). IEEE, 1650–1654

  36. [36]

    Ziyang Liu, Emmanuel Agu, Peder Pedersen, Clifford Lindsay, Bengisu Tulu, and Diane Strong. 2021. Comprehensive assessment of fine-grained wound images using a patch-based CNN with context-preserving attention.IEEE open journal of engineering in medicine and biology2 (2021), 224–234

  37. [37]

    Ziyu Liu, Zeyi Sun, Yuhang Zang, Xiaoyi Dong, Yuhang Cao, Haodong Duan, Dahua Lin, and Jiaqi Wang. 2025. Visual-rft: Visual reinforcement fine-tuning. InProceedings of the IEEE/CVF International Conference on Computer Vision. 2034–2044

  38. [38]

    Lorna MacLellan, G Gardner, and Anne Gardner. 2002. Designing the future in wound care: the role of the nurse practitioner.Primary Intention: The Australian Journal of Wound Management10, 3 (2002)

  39. [39]

    Jay R McDonald, Stephen Y Liang, Ping Li, Salwa Maalouf, Clinton K Murray, Amy C Weintrob, Elizabeth R Schnaubelt, Janis Kuhn, Anuradha Ganesan, William Bradley, et al. 2018. Infectious complications after deployment trauma: following wounded US military personnel into veterans affairs care.Clinical infectious diseases67, 8 (2018), 1205–1212

  40. [40]

    Medetec. [n. d.].Surgical Dressings and Wound Management Resource Centre - Home page. https://www.medetec.co.uk/ 26•Busaranuvong et al

  41. [41]

    Joseph L Mills Sr, Michael S Conte, David G Armstrong, Frank B Pomposelli, Andres Schanzer, Anton N Sidawy, George Andros, Society for Vascular Surgery Lower Extremity Guidelines Committee, et al. 2014. The society for vascular surgery lower extremity threatened limb classification system: risk stratification based on wound, ischemia, and foot infection (...

  42. [42]

    Holly Nguyen, Emmanuel Agu, Bengisu Tulu, Diane Strong, Haadi Mombini, Peder Pedersen, Clifford Lindsay, Raymond Dunn, and Lorraine Loretz. 2020. Machine learning models for synthesizing actionable care decisions on lower extremity wounds.Smart Health18 (2020), 100139

  43. [43]

    Samuel R Nussbaum, Marissa J Carter, Caroline E Fife, Joan DaVanzo, Randall Haught, Marcia Nusgart, and Donna Cartwright. 2018. An economic evaluation of the impact, cost, and medicare policy implications of chronic nonhealing wounds.Value in Health21, 1 (2018), 27–32

  44. [44]

    Maja Olsson, Krister Järbrink, Ushashree Divakar, Ram Bajpai, Zee Upton, Artur Schmidtchen, and Josip Car. 2019. The humanistic and economic burden of chronic wounds: A systematic review.Wound repair and regeneration27, 1 (2019), 114–125

  45. [45]

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. 2022. Training language models to follow instructions with human feedback.Advances in neural information processing systems35 (2022), 27730–27744

  46. [46]

    Jiazhen Pan, Che Liu, Junde Wu, Fenglin Liu, Jiayuan Zhu, Hongwei Bran Li, Chen Chen, Cheng Ouyang, and Daniel Rueckert. 2025. Medvlm-r1: Incentivizing medical reasoning capability of vision-language models (vlms) via reinforcement learning. InInternational Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, 337–347

  47. [47]

    Abdul Qayyum, Abdesslam Benzinou, Moona Mazher, and Fabrice Meriaudeau. 2021. Efficient multi-model vision transformer based on feature fusion for classification of dfuc2021 challenge. InDiabetic foot ulcers grand challenge. Springer, 62–75

  48. [48]

    Jean-Louis Richard, Albert Sotto, and Jean-Philippe Lavigne. 2011. New insights in diabetic foot infection.World journal of diabetes2, 2 (2011), 24–32

  49. [49]

    Armand ALM Rondas, Jos MGA Schols, Ellen E Stobberingh, and Ruud JG Halfens. 2015. Prevalence of chronic wounds and structural quality indicators of chronic wound care in Dutch nursing homes.International wound journal12, 6 (2015), 630–635

  50. [50]

    Khaled Saab, Tao Tu, Wei-Hung Weng, Ryutaro Tanno, David Stutz, Ellery Wulczyn, Fan Zhang, Tim Strother, Chunjong Park, Elahe Vedadi, et al. 2024. Capabilities of gemini models in medicine.arXiv preprint arXiv:2404.18416(2024)

  51. [51]

    Soroush Safavi, Keshav Prasad, Lewis Forster, Eran Klang, Bharath Dasari, and Uri Ladabaum. 2025. Benchmarking proprietary and open-source language and vision-language models for gastroenterology clinical reasoning.npj Digital Medicine8, 1 (2025), 259. doi:10.1038/s41746-025-02174-0

  52. [52]

    Giulio Scebba, Jing Zhang, Simone Catanzaro, et al. 2021. Detect-and-Segment: A deep learning approach to automate wound image segmentation.arXiv preprint arXiv:2111.01590(2021)

  53. [53]

    Éric Senneville, Zaina Albalawi, Suzanne A Van Asten, Zulfiqarali G Abbas, Geneve Allison, Javier Aragón-Sánchez, John M Embil, Lawrence A Lavery, Majdi Alhasan, Orhan Oz, et al. 2024. IWGDF/IDSA guidelines on the diagnosis and treatment of diabetes-related foot infections (IWGDF/IDSA 2023).Diabetes/metabolism research and reviews40, 3 (2024), e3687

  54. [54]

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. 2024. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300(2024)

  55. [55]

    Varun Shenoy, Emily Foster, Laleh Aalami, Bilal Majeed, and Oliver Aalami. 2018. Deepwound: Automated postoperative wound assessment and surgical site surveillance through convolutional neural networks.arXiv preprint arXiv:1807.04355(2018)

  56. [56]

    Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, et al. 2025. Openai gpt-5 system card.arXiv preprint arXiv:2601.03267(2025)

  57. [57]

    Yvonne Stallard. 2018. When and how to perform cultures on chronic wounds?Journal of Wound Ostomy & Continence Nursing45, 2 (2018), 179–186

  58. [58]

    Mingxing Tan and Quoc Le. 2019. Efficientnet: Rethinking model scaling for convolutional neural networks. InProc. ICML. PMLR, 6105–6114

  59. [59]

    Qwen Team. 2026. Qwen3.5: Accelerating Productivity with Native Multimodal Agents. https://qwen.ai/blog?id=qwen3.5

  60. [60]

    Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Hervé Jégou. 2021. Training data-efficient image transformers & distillation through attention. InProc. ICML. PMLR, 10347–10357

  61. [61]

    Tao Tu, Shekoofeh Azizi, Danny Driess, Mike Schaekermann, Mohamed Amin, Pi-Chuan Chang, Andrew Carroll, Charles Lau, Ryutaro Tanno, Ira Ktena, et al. 2024. Towards generalist biomedical AI.Nejm Ai1, 3 (2024), AIoa2300138

  62. [62]

    Changhan Wang, Xinchen Yan, Max Smith, Kanika Kochhar, Marcie Rubin, Stephen M Warren, James Wrobel, and Honglak Lee. 2015. A unified framework for automatic wound segmentation and analysis with deep convolutional neural networks. InProc Int’l Conf. Engr. Medicine and Biology (EMBC). IEEE, 2415–2418

  63. [63]

    Yiping Wang, Qing Yang, Zhiyuan Zeng, Liliang Ren, Liyuan Liu, Baolin Peng, Hao Cheng, Xuehai He, Kuan Wang, Jianfeng Gao, et al

  64. [64]

    Reinforcement learning for reasoning in large language models with one training example, 2025

    Reinforcement learning for reasoning in large language models with one training example.arXiv preprint arXiv:2504.20571(2025). Infection-Reasoner: A Compact Vision-Language Model for Wound Infection Classification with Evidence-Grounded Clinical Reasoning• 27

  65. [65]

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. 2022. Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems35 (2022), 24824–24837

  66. [66]

    Wayne A Wilbright, James A Birke, Charles A Patout, Myra Varnado, and Ron Horswell. 2004. The use of telemedicine in the management of diabetes-related foot ulceration: a pilot study.Advances in skin & wound care17, 5 (2004), 232–238

  67. [67]

    Jason T Wiseman, Sara Fernandes-Taylor, Rebecca Gunter, Maggie L Barnes, Richard Scott Saunders, Paul J Rathouz, Dai Yamanouchi, and K Craig Kent. 2016. Inter-rater agreement and checklist validation for postoperative wound assessment using smartphone images in vascular surgery.Journal of Vascular Surgery: Venous and Lymphatic Disorders4, 3 (2016), 320–328

  68. [68]

    World Union of Wound Healing Societies (WUWHS). 2016. Advances in Wound Care: The Triangle of Wound Assess- ment. Wounds International, Florence Congress Position Document. https://wounds-me.com/wp-content/uploads/2023/02/ 963bb683c483c1fd1e33c7af6901657a.pdf

  69. [69]

    Weiwen Xu, Hou Pong Chan, Long Li, Mahani Aljunied, Ruifeng Yuan, Jianyu Wang, Chenghao Xiao, Guizhen Chen, Chaoqun Liu, Zhaodonghui Li, et al. 2025. Lingshu: A Generalist Foundation Model for Unified Multimodal Medical Understanding and Reasoning. arXiv preprint arXiv:2506.07044(2025)

  70. [70]

    Moi Hoon Yap, Bill Cassidy, Joseph M Pappachan, Claire O’Shea, David Gillespie, and Neil D Reeves. 2021. Analysis towards classification of infection and ischaemia of diabetic foot ulcers. In2021 IEEE EMBS International Conference on Biomedical and Health Informatics (BHI). IEEE, 1–4

  71. [71]

    Yang Yue, Zhiqi Chen, Rui Lu, Andrew Zhao, Zhaokai Wang, Shiji Song, and Gao Huang. 2025. Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?arXiv preprint arXiv:2504.13837(2025)

  72. [72]

    Hongbo Zhang, Junying Chen, Feng Jiang, Fei Yu, Zhihong Chen, Guiming Chen, Jianquan Li, Xiangbo Wu, Zhang Zhiyi, Qingying Xiao, et al. 2023. Huatuogpt, towards taming language model to be a doctor. InFindings of the association for computational linguistics: EMNLP 2023. 10859–10885

  73. [73]

    Med-rlvr: Emerging medical reasoning from a 3b base model via reinforcement learning.arXiv preprint arXiv:2502.19655, 2025

    Sheng Zhang, Qianchu Liu, Guanghui Qin, Tristan Naumann, and Hoifung Poon. 2025. Med-rlvr: Emerging medical reasoning from a 3b base model via reinforcement learning.arXiv preprint arXiv:2502.19655(2025). 8 Appendix 8.1 MLLM-as-a-Judge Prompt Templates System Prompt Template You are a meticulous wound-rationale evaluator. You will be given: (1) a wound im...

  74. [74]

    cellulitis_spreading_redness

  75. [75]

    necrotic_tissue_eschar

  76. [76]

    - Never use the text to invent visual evidence

    maceration Important rules: - Never use the image to change TEXT_CLAIM. - Never use the text to invent visual evidence. - Be conservative: prefer UNC over guessing. - Return STRICT JSON only. - No markdown, no prose, no code fences. User Prompt Template Inputs:

  77. [77]

    Wound image: (attached)

  78. [78]

    parsed_think

    Model generation text: Infection-Reasoner: A Compact Vision-Language Model for Wound Infection Classification with Evidence-Grounded Clinical Reasoning• 29 {GENERATION_TEXT} Return STRICT JSON with this exact schema: { "parsed_think": "string", "rubric": { "purulence_pus": { "TEXT_CLAIM":"POS|NEG|UNC|NOT_MENTIONED", "text_span":"string", "IMAGE_EVIDENCE":...