arxiv: 2604.19937 · v1 · submitted 2026-04-21 · 💻 cs.CV · cs.AI

Recognition: unknown

Infection-Reasoner: A Compact Vision-Language Model for Wound Infection Classification with Evidence-Grounded Clinical Reasoning

Palawat Busaranuvong , Reza Saadati Fard , Emmanuel Agu , Deepak Kumar , Shefalika Gautam , Bengisu Tulu , Diane Strong

Authors on Pith no claims yet

Pith reviewed 2026-05-10 03:11 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords wound infection classificationvision-language modelclinical reasoningreasoning distillationreinforcement learningchronic wound imagesinterpretabilitymedical image analysis

0 comments

The pith

A 4B-parameter vision-language model classifies chronic wound infections from photos at 86.8% accuracy and generates rationales that experts rate correct or partially correct 94% of the time.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Infection-Reasoner, a compact vision-language model designed to classify infections in chronic wound photographs while producing evidence-grounded clinical explanations. It tackles data scarcity through a two-stage process that first distills chain-of-thought rationales from GPT-5.1 on unlabeled images and then refines the model with reinforcement learning on a small labeled infection dataset. The resulting model is evaluated on a held-out heterogeneous collection of wound images varying in cause, location, and capture conditions. It reports higher accuracy, sensitivity, and specificity than several baselines including the larger GPT-5.1 teacher model. Separate assessments by multiple multimodal judges and wound experts confirm that the generated rationales are largely supported by visible image features.

Core claim

Infection-Reasoner achieves 86.8% accuracy, 86.4% sensitivity, and 87.1% specificity on a held-out heterogeneous wound dataset, outperforming GPT-5.1 and other baselines, while producing rationales that receive visual-support agreement scores of 0.722-0.903 from MLLM judges and are rated Correct by experts in 61.8% of cases and Partially Correct in 32.4% of cases.

What carries the argument

The two-stage training pipeline that first performs reasoning distillation of GPT-5.1 chain-of-thought rationales onto the Qwen3-VL-4B-Thinking student model and then applies Group Relative Policy Optimization reinforcement learning on a small labeled infection dataset.

If this is right

The model supplies both a classification decision and an explicit visual reasoning trace suitable for point-of-care review.
Performance exceeds that of the larger GPT-5.1 model despite using far fewer parameters.
Rationale quality remains high across heterogeneous wound etiologies, locations, and imaging conditions.
The pipeline reduces reliance on large volumes of expert-annotated reasoning data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same distillation-plus-RL recipe could be tested on other medical image classification tasks that currently lack reasoning annotations.
Deployment on mobile devices becomes feasible because the final model is only 4B parameters.
If rationale quality holds in prospective clinical use, the outputs could serve as training material for human clinicians.

Load-bearing premise

GPT-5.1-generated rationales on unlabeled wound images supply accurate and unbiased supervision that the small labeled dataset can then refine without inheriting teacher errors or causing overfitting.

What would settle it

Retraining the same 4B base model on the identical small labeled set but without the GPT-5.1 distillation stage and measuring whether accuracy falls below 86.8% or expert-rated rationale correctness drops below 60%.

Figures

Figures reproduced from arXiv: 2604.19937 by Bengisu Tulu, Deepak Kumar, Diane Strong, Emmanuel Agu, Palawat Busaranuvong, Reza Saadati Fard, Shefalika Gautam.

**Figure 2.** Figure 2: Reasoning prompt used for teacher and student CoT generation in Stage 1 and for student RL post-training in Stage 2. [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Example wound image augmentations used to increase appearance diversity while preserving clinical content. From [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: Example Stage 1 pair: wound image (left) and teacher-generated rationale with final answer (right). [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗

**Figure 5.** Figure 5: Examples of labeled wound images used for RL post-training. [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗

**Figure 6.** Figure 6: Qualitative rationale comparison across models on the same wound image. [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗

**Figure 7.** Figure 7: Accuracy analysis across wound types. Comparison of classification accuracy between Infection-Reasoner and GPT-5.1 [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗

**Figure 8.** Figure 8: Distribution of per-image rationale-grounding agreement scores across four MLLM judges over five runs. For each [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗

**Figure 9.** Figure 9: Human expert evaluation of Infection-Reasoner rationales on the subset of cases where the wound expert agreed with [PITH_FULL_IMAGE:figures/full_fig_p020_9.png] view at source ↗

**Figure 10.** Figure 10: Expert-aligned correct-case example. Infection-Reasoner predicts [PITH_FULL_IMAGE:figures/full_fig_p020_10.png] view at source ↗

**Figure 11.** Figure 11: Example of a partially correct reasoning case. Infection-Reasoner correctly predicts the final label ( [PITH_FULL_IMAGE:figures/full_fig_p021_11.png] view at source ↗

**Figure 12.** Figure 12: Failure-mode example showing a completely incorrect Infection-Reasoner prediction and rationale. [PITH_FULL_IMAGE:figures/full_fig_p021_12.png] view at source ↗

read the original abstract

Assessing chronic wound infection from photographs is challenging because visual appearance varies across wound etiologies, anatomical locations, and imaging conditions. Prior image-based deep learning methods have mainly focused on classification with limited interpretability, despite the need for evidence-grounded explanations to support point-of-care decision making. We present Infection-Reasoner, a compact 4B-parameter reasoning vision-language model for chronic wound infection classification and rationale generation. To address the scarcity of expert-labeled wound images with reasoning annotations, Infection-Reasoner is trained using a two-stage pipeline: (1) reasoning distillation, in which GPT-5.1 generates chain-of-thought rationales for unlabeled wound images to initialize wound-specific reasoning in a smaller student model (Qwen3-VL-4B-Thinking), and (2) reinforcement learning post-training with Group Relative Policy Optimization on a small labeled infection dataset to refine classification reasoning. On a held-out heterogeneous wound dataset, Infection-Reasoner achieved 86.8\% accuracy, 86.4\% sensitivity, and 87.1\% specificity, outperforming several strong baselines, including GPT-5.1. Rationale quality was further evaluated using both multimodal large language model (MLLM) judges and wound expert review. Across four MLLM judges, visual-support agreement scores ranged from 0.722 to 0.903, while expert review rated 61.8\% of rationales as Correct and 32.4\% as Partially Correct.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a compact 4B VLM for wound infection classification via GPT-5.1 distillation plus GRPO, with decent accuracy and some expert rationale checks, but the unvalidated teacher step is the main open question.

read the letter

The core contribution is a two-stage training setup for a small vision-language model aimed at chronic wound photos. GPT-5.1 first generates chain-of-thought rationales on unlabeled images to bootstrap reasoning in the Qwen3-VL-4B student, then group relative policy optimization refines it on a small labeled infection set. On held-out data the model reaches 86.8% accuracy, 86.4% sensitivity and 87.1% specificity while beating several baselines including the teacher model itself. Expert review later rates 61.8% of the generated rationales as fully correct and 32.4% as partially correct, with additional MLLM judge agreement scores in the 0.72-0.90 range. That combination of size, performance and rationale evaluation is the concrete advance here, and it targets a setting where visual diagnosis is known to be variable and expert access limited. The pipeline itself is a reasonable way to stretch scarce labeled reasoning data, and reporting both model and human judgments on the explanations is better than accuracy alone. The main gap is the missing validation of the GPT-5.1 rationales on the unlabeled distillation images. No error rates, expert agreement figures or even spot checks are described for that step, so it is hard to know how much noise or bias gets baked in before the RL stage. The final expert scores are post-RL only and do not isolate whether the distillation helped or simply carried forward teacher mistakes. Dataset size, diversity statistics, significance tests and ablation results are also absent from the abstract, which leaves the robustness of the reported gains unclear. This work is aimed at groups building compact medical VLMs or testing point-of-care tools for wound care. Readers who care about practical deployment and rationale quality will find the pipeline and evaluation details useful even if the absolute numbers are incremental. It is worth sending to peer review because the empirical results and clinical framing are there; reviewers will likely press for teacher validation data and more controls, but the paper is coherent enough to benefit from that process.

Referee Report

2 major / 0 minor

Summary. The manuscript introduces Infection-Reasoner, a 4B-parameter vision-language model for chronic wound infection classification from photographs together with evidence-grounded rationale generation. Training proceeds in two stages: (1) reasoning distillation in which GPT-5.1 produces chain-of-thought rationales on unlabeled wound images to initialize a Qwen3-VL-4B-Thinking student, and (2) Group Relative Policy Optimization (GRPO) reinforcement learning on a small labeled infection dataset. On a held-out heterogeneous wound dataset the model reports 86.8% accuracy, 86.4% sensitivity and 87.1% specificity, outperforming several baselines including GPT-5.1. Rationale quality is assessed by four MLLM judges (visual-support agreement 0.722–0.903) and by wound-expert review (61.8% rated Correct, 32.4% Partially Correct).

Significance. If the empirical claims hold after the requested clarifications, the work supplies a compact, interpretable model that directly addresses the interpretability gap in prior image-only deep-learning wound classifiers. The distillation-plus-RL recipe for injecting clinical reasoning under limited labeled data is technically interesting and the reported outperformance of GPT-5.1 is a notable result. The combination of MLLM and expert rationale evaluation is a positive step toward grounded assessment.

major comments (2)

[Abstract] Abstract: the reported performance figures (86.8% accuracy, 86.4% sensitivity, 87.1% specificity) and the claim of outperforming GPT-5.1 are presented without dataset size, diversity statistics, statistical significance tests, exact baseline implementations, or ablation results. These omissions are load-bearing for any claim that the two-stage pipeline yields reliable clinical reasoning.
[Abstract] Abstract (two-stage pipeline description): no quantitative validation (expert agreement, error rate, or inter-rater reliability) is supplied for the GPT-5.1-generated chain-of-thought rationales on the unlabeled distillation images. Because the final expert review (61.8% Correct) occurs only after GRPO and does not isolate the distillation contribution, it is impossible to determine whether the reported rationale quality reflects genuine clinical reasoning or inherited teacher artifacts.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thorough and constructive review. We address each major comment below and have revised the manuscript to improve clarity and transparency regarding our experimental details and limitations.

read point-by-point responses

Referee: [Abstract] Abstract: the reported performance figures (86.8% accuracy, 86.4% sensitivity, 87.1% specificity) and the claim of outperforming GPT-5.1 are presented without dataset size, diversity statistics, statistical significance tests, exact baseline implementations, or ablation results. These omissions are load-bearing for any claim that the two-stage pipeline yields reliable clinical reasoning.

Authors: We agree that the abstract, as a concise summary, omits supporting details that are important for evaluating the claims. The full manuscript provides the held-out dataset description (size, diversity across etiologies, anatomical locations, and imaging conditions) in the experimental setup, reports statistical significance tests for outperformance over baselines including GPT-5.1 in the results, details exact baseline implementations and hyperparameters in the methods, and presents ablation studies on the two-stage pipeline. To address the concern, we have revised the abstract to include a brief reference to dataset scale and statistical significance while maintaining length constraints. revision: partial
Referee: [Abstract] Abstract (two-stage pipeline description): no quantitative validation (expert agreement, error rate, or inter-rater reliability) is supplied for the GPT-5.1-generated chain-of-thought rationales on the unlabeled distillation images. Because the final expert review (61.8% Correct) occurs only after GRPO and does not isolate the distillation contribution, it is impossible to determine whether the reported rationale quality reflects genuine clinical reasoning or inherited teacher artifacts.

Authors: We acknowledge the value of isolating the distillation stage's contribution through separate quantitative validation of the GPT-5.1 rationales. This was not performed due to the substantial expert annotation costs involved. The expert review was conducted on the final post-GRPO outputs to evaluate the end-to-end system, while MLLM judge scores offer additional supporting evidence for rationale quality. In the revised manuscript, we have added a dedicated limitations paragraph in the discussion that explicitly notes this gap, explains how GRPO refines initial reasoning, and outlines plans for future work on granular distillation ablations. revision: yes

Circularity Check

0 steps flagged

No circularity: results from empirical training on external data with held-out evaluation

full rationale

The paper presents a standard two-stage ML pipeline (reasoning distillation from GPT-5.1 on unlabeled wound images, followed by GRPO RL refinement on a small labeled set) and reports accuracy/sensitivity/specificity on a held-out heterogeneous dataset plus independent rationale quality checks by MLLM judges and wound experts. No equations, fitted parameters renamed as predictions, self-definitional constructs, or load-bearing self-citations appear in the derivation chain. All performance numbers derive from external data splits and external judges rather than reducing to the training inputs by construction.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central performance and rationale-quality claims rest on the quality of synthetic reasoning data from GPT-5.1 and the representativeness of the small labeled wound dataset; neither is independently verified beyond the reported aggregate metrics.

free parameters (1)

GRPO hyperparameters
Group Relative Policy Optimization settings are chosen during post-training but not quantified in the abstract.

axioms (1)

domain assumption GPT-5.1 chain-of-thought rationales on unlabeled wound images constitute high-quality supervision for clinical reasoning
This premise underpins the entire first-stage distillation.

pith-pipeline@v0.9.0 · 5596 in / 1317 out tokens · 60817 ms · 2026-05-10T03:11:11.828887+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

78 extracted references · 18 canonical work pages · 9 internal anchors

[1]

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023. Gpt-4 technical report.arXiv preprint arXiv:2303.08774(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

Nora Al-Garaawi, Raja Ebsim, Abbas FH Alharan, and Moi Hoon Yap. 2022. Diabetic foot ulcer classification using mapped binary patterns and convolutional neural networks.Computers in biology and medicine140 (2022), 105055

2022
[3]

2024.The Claude 3 Model Family: Opus, Sonnet, Haiku

Anthropic. 2024.The Claude 3 Model Family: Opus, Sonnet, Haiku. Technical Report. Anthropic. https://www-cdn.anthropic.com/ de8ba9b01c9ab7cbabf5c33b80b7bbc618857627/Model_Card_Claude_3.pdf Model card

2024
[4]

Anthropic. 2026. Claude Sonnet 4.6 System Card. https://www-cdn.anthropic.com/bbd8ef16d70b7a1665f14f306ee88b53f686aa75.pdf. Technical report

2026
[5]

Shuai Bai, Yulong Cai, Rui Chen, et al. 2025. Qwen3-VL technical report.arXiv preprint arXiv:2511.21631(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Jauvin. 2003. A neural probabilistic language model.Journal of machine learning research3, Feb (2003), 1137–1155

2003
[7]

Akhiad Bercovich, Itay Levy, Izik Golan, Mohammad Dabbah, Ran El-Yaniv, Omri Puny, Ido Galil, Zach Moshe, Tomer Ronen, Najeeb Nabwani, et al. 2025. Llama-nemotron: Efficient reasoning models.arXiv preprint arXiv:2505.00949(2025)

work page arXiv 2025
[8]

Steven Bowers and Eginia Franco. 2020. Chronic wounds: evaluation and management.American family physician101, 3 (2020), 159–166

2020
[9]

Palawat Busaranuvong, Emmanuel Agu, Deepak Kumar, Shefalika Gautam, Reza Saadati Fard, Bengisu Tulu, and Diane Strong. 2025. Guided Conditional Diffusion Classifier (ConDiff) for Enhanced Prediction of Infection in Diabetic Foot Ulcers.IEEE Open Journal of Engineering in Medicine and Biology6 (2025), 20–27. doi:10.1109/OJEMB.2024.3453060

work page doi:10.1109/ojemb.2024.3453060 2025
[10]

Palawat Busaranuvong, Emmanuel Agu, Reza Saadati Fard, Deepak Kumar, Shefalika Gautam, Bengisu Tulu, Diane Strong, and Lorraine Loretz. 2025. Explainable, multi-modal wound infection classification from images augmented with generated captions.ACM Transactions on Computing for Healthcare(2025)

2025
[11]

Caroline Chanussot-Deprez and José Contreras-Ruiz. 2013. Telemedicine in wound care: a review.Advances in skin & wound care26, 2 (2013), 78–82

2013
[12]

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al . 2025. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[13]

Margaret A Fonder, Gerald S Lazarus, David A Cowan, Barbara Aronson-Cook, Angela R Kohli, and Adam J Mamelak. 2008. Treating the chronic wound: A practical approach to the care of nonhealing wounds and wound care dressings.Journal of the American Academy of Dermatology58, 2 (2008), 185–206. Infection-Reasoner: A Compact Vision-Language Model for Wound Inf...

2008
[14]

Peter J Franks, Judith Barker, Mark Collier, Georgina Gethin, Emily Haesler, Arkadiusz Jawien, Severin Laeuchli, Giovanni Mosti, Sebastian Probst, and Carolina Weller. 2016. Management of patients with venous leg ulcers: challenges and current best practice. Journal of wound care25, Sup6 (2016), S1–S67

2016
[15]

Adrian Galdran, Gustavo Carneiro, and Miguel A González Ballester. 2021. Convolutional nets versus vision transformers for diabetic foot ulcer classification. InDiabetic Foot Ulcers Grand Challenge. Springer, 21–29

2021
[16]

Google. 2025. MedGemma: A Gemma 3 Variant Optimized for Medical Text and Image Comprehension. https://deepmind.google/ models/gemma/medgemma/

2025
[17]

2025.Gemini 3 Pro Model Card

Google DeepMind. 2025.Gemini 3 Pro Model Card. Technical Report. Google DeepMind. https://storage.googleapis.com/deepmind- media/Model-Cards/Gemini-3-Pro-Model-Card.pdf Accessed: 2026-03-11

2025
[18]

Lisa Gould, Peter Abadir, Harold Brem, Marissa Carter, Teresa Conner-Kerr, Jeff Davidson, Luisa DiPietro, Vincent Falanga, Caroline Fife, Sue Gardner, et al. 2015. Chronic wound repair and healing in older adults: current status and future research.Wound Repair and Regeneration23, 1 (2015), 1–13

2015
[19]

Manu Goyal, Neil D Reeves, Satyan Rajbhandari, Naseer Ahmad, Chuan Wang, and Moi Hoon Yap. 2020. Recognition of ischaemia and infection in diabetic foot ulcers: Dataset and techniques.Computers in biology and medicine117 (2020), 103616

2020
[20]

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. 2024. The llama 3 herd of models.arXiv preprint arXiv:2407.21783(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[21]

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al
[22]

Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[23]

Xuehai et al. He. 2020. PathVQA: 30000+ Questions for Medical Visual Question Answering. InAAAI

2020
[24]

K Järbrink, G Ni, H Sönnergren, A Schmidtchen, C Pang, R Bajpai, and J Car. [n. d.]. The humanistic and economic burden of chronic wounds: a protocol for a systematic review. Syst Rev. 2017; 6 (1): 15

2017
[25]

Qiao Jin, Fangyuan Chen, Yiliang Zhou, Ziyang Xu, Justin M Cheung, Robert Chen, Ronald M Summers, Justin F Rousseau, Peiyun Ni, Marc J Landsman, et al. 2024. Hidden flaws behind expert-level accuracy of multimodal GPT-4 vision in medicine.NPJ Digital Medicine 7, 1 (2024), 190

2024
[26]

Robert Kaczmarczyk, Theresa Isabelle Wilhelm, Ron Martin, and Jonas Roos. 2024. Evaluating multimodal AI in medical diagnostics. NPJ Digital Medicine7, 1 (2024), 205

2024
[27]

Yuxiang Lai, Jike Zhong, Ming Li, Shitian Zhao, Yuheng Li, Konstantinos Psounis, and Xiaofeng Yang. 2026. Med-r1: Reinforcement learning for generalizable medical reasoning in vision-language models.IEEE Transactions on Medical Imaging(2026)

2026
[28]

Stephan J Landis. 2008. Chronic wound infection and antimicrobial use.Advances in skin & wound care21, 11 (2008), 531–540

2008
[29]

Jason J Lau, Soumya Gayen, Asma Ben Abacha, and Dina Demner-Fushman. 2018. A dataset of clinically generated visual questions and answers about radiology images.Scientific data5, 1 (2018), 1–10

2018
[30]

Chunyuan Li, Cliff Wong, Sheng Zhang, Naoto Usuyama, Haotian Liu, Jianwei Yang, Tristan Naumann, Hoifung Poon, and Jianfeng Gao
[31]

LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day.arXiv preprint arXiv:2306.00890(2023)

work page arXiv 2023
[32]

Liunian Harold Li, Jack Hessel, Youngjae Yu, Xiang Ren, Kai-Wei Chang, and Yejin Choi. 2023. Symbolic chain-of-thought distillation: Small models can also “think” step-by-step. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2665–2679

2023
[33]

Benjamin A Lipsky, Anthony R Berendt, Paul B Cornia, John C Pile, Edgar J G Peters, David G Armstrong, H Gunner Deery, John M Embil, Warren S Joseph, Adolf W Karchmer, et al. 2012. 2012 infectious diseases society of america clinical practice guideline for the diagnosis and treatment of diabetic foot infections.Clinical Infectious Diseases54, 12 (2012), e132–e173

2012
[34]

Benjamin A Lipsky, Matthew Dryden, Finn Gottrup, Dilip Nathwani, Ronald Andrew Seaton, and Jan Stryja. 2016. Antimicrobial stewardship in wound care: a position paper from the British Society for Antimicrobial Chemotherapy and European Wound Mgmt Assoc.J. Antimicrobial Chemotherapy71, 11 (2016), 3026–3035

2016
[35]

Bo Liu, Li-Ming Zhan, Li Xu, Lin Ma, Yan Yang, and Xiao-Ming Wu. 2021. Slake: A semantically-labeled knowledge-enhanced dataset for medical visual question answering. In2021 IEEE 18th international symposium on biomedical imaging (ISBI). IEEE, 1650–1654

2021
[36]

Ziyang Liu, Emmanuel Agu, Peder Pedersen, Clifford Lindsay, Bengisu Tulu, and Diane Strong. 2021. Comprehensive assessment of fine-grained wound images using a patch-based CNN with context-preserving attention.IEEE open journal of engineering in medicine and biology2 (2021), 224–234

2021
[37]

Ziyu Liu, Zeyi Sun, Yuhang Zang, Xiaoyi Dong, Yuhang Cao, Haodong Duan, Dahua Lin, and Jiaqi Wang. 2025. Visual-rft: Visual reinforcement fine-tuning. InProceedings of the IEEE/CVF International Conference on Computer Vision. 2034–2044

2025
[38]

Lorna MacLellan, G Gardner, and Anne Gardner. 2002. Designing the future in wound care: the role of the nurse practitioner.Primary Intention: The Australian Journal of Wound Management10, 3 (2002)

2002
[39]

Jay R McDonald, Stephen Y Liang, Ping Li, Salwa Maalouf, Clinton K Murray, Amy C Weintrob, Elizabeth R Schnaubelt, Janis Kuhn, Anuradha Ganesan, William Bradley, et al. 2018. Infectious complications after deployment trauma: following wounded US military personnel into veterans affairs care.Clinical infectious diseases67, 8 (2018), 1205–1212

2018
[40]

Medetec. [n. d.].Surgical Dressings and Wound Management Resource Centre - Home page. https://www.medetec.co.uk/ 26•Busaranuvong et al
[41]

Joseph L Mills Sr, Michael S Conte, David G Armstrong, Frank B Pomposelli, Andres Schanzer, Anton N Sidawy, George Andros, Society for Vascular Surgery Lower Extremity Guidelines Committee, et al. 2014. The society for vascular surgery lower extremity threatened limb classification system: risk stratification based on wound, ischemia, and foot infection (...

2014
[42]

Holly Nguyen, Emmanuel Agu, Bengisu Tulu, Diane Strong, Haadi Mombini, Peder Pedersen, Clifford Lindsay, Raymond Dunn, and Lorraine Loretz. 2020. Machine learning models for synthesizing actionable care decisions on lower extremity wounds.Smart Health18 (2020), 100139

2020
[43]

Samuel R Nussbaum, Marissa J Carter, Caroline E Fife, Joan DaVanzo, Randall Haught, Marcia Nusgart, and Donna Cartwright. 2018. An economic evaluation of the impact, cost, and medicare policy implications of chronic nonhealing wounds.Value in Health21, 1 (2018), 27–32

2018
[44]

Maja Olsson, Krister Järbrink, Ushashree Divakar, Ram Bajpai, Zee Upton, Artur Schmidtchen, and Josip Car. 2019. The humanistic and economic burden of chronic wounds: A systematic review.Wound repair and regeneration27, 1 (2019), 114–125

2019
[45]

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. 2022. Training language models to follow instructions with human feedback.Advances in neural information processing systems35 (2022), 27730–27744

2022
[46]

Jiazhen Pan, Che Liu, Junde Wu, Fenglin Liu, Jiayuan Zhu, Hongwei Bran Li, Chen Chen, Cheng Ouyang, and Daniel Rueckert. 2025. Medvlm-r1: Incentivizing medical reasoning capability of vision-language models (vlms) via reinforcement learning. InInternational Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, 337–347

2025
[47]

Abdul Qayyum, Abdesslam Benzinou, Moona Mazher, and Fabrice Meriaudeau. 2021. Efficient multi-model vision transformer based on feature fusion for classification of dfuc2021 challenge. InDiabetic foot ulcers grand challenge. Springer, 62–75

2021
[48]

Jean-Louis Richard, Albert Sotto, and Jean-Philippe Lavigne. 2011. New insights in diabetic foot infection.World journal of diabetes2, 2 (2011), 24–32

2011
[49]

Armand ALM Rondas, Jos MGA Schols, Ellen E Stobberingh, and Ruud JG Halfens. 2015. Prevalence of chronic wounds and structural quality indicators of chronic wound care in Dutch nursing homes.International wound journal12, 6 (2015), 630–635

2015
[50]

Khaled Saab, Tao Tu, Wei-Hung Weng, Ryutaro Tanno, David Stutz, Ellery Wulczyn, Fan Zhang, Tim Strother, Chunjong Park, Elahe Vedadi, et al. 2024. Capabilities of gemini models in medicine.arXiv preprint arXiv:2404.18416(2024)

work page arXiv 2024
[51]

Soroush Safavi, Keshav Prasad, Lewis Forster, Eran Klang, Bharath Dasari, and Uri Ladabaum. 2025. Benchmarking proprietary and open-source language and vision-language models for gastroenterology clinical reasoning.npj Digital Medicine8, 1 (2025), 259. doi:10.1038/s41746-025-02174-0

work page doi:10.1038/s41746-025-02174-0 2025
[52]

Giulio Scebba, Jing Zhang, Simone Catanzaro, et al. 2021. Detect-and-Segment: A deep learning approach to automate wound image segmentation.arXiv preprint arXiv:2111.01590(2021)

work page arXiv 2021
[53]

Éric Senneville, Zaina Albalawi, Suzanne A Van Asten, Zulfiqarali G Abbas, Geneve Allison, Javier Aragón-Sánchez, John M Embil, Lawrence A Lavery, Majdi Alhasan, Orhan Oz, et al. 2024. IWGDF/IDSA guidelines on the diagnosis and treatment of diabetes-related foot infections (IWGDF/IDSA 2023).Diabetes/metabolism research and reviews40, 3 (2024), e3687

2024
[54]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. 2024. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[55]

Varun Shenoy, Emily Foster, Laleh Aalami, Bilal Majeed, and Oliver Aalami. 2018. Deepwound: Automated postoperative wound assessment and surgical site surveillance through convolutional neural networks.arXiv preprint arXiv:1807.04355(2018)

work page arXiv 2018
[56]

Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, et al. 2025. Openai gpt-5 system card.arXiv preprint arXiv:2601.03267(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[57]

Yvonne Stallard. 2018. When and how to perform cultures on chronic wounds?Journal of Wound Ostomy & Continence Nursing45, 2 (2018), 179–186

2018
[58]

Mingxing Tan and Quoc Le. 2019. Efficientnet: Rethinking model scaling for convolutional neural networks. InProc. ICML. PMLR, 6105–6114

2019
[59]

Qwen Team. 2026. Qwen3.5: Accelerating Productivity with Native Multimodal Agents. https://qwen.ai/blog?id=qwen3.5

2026
[60]

Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Hervé Jégou. 2021. Training data-efficient image transformers & distillation through attention. InProc. ICML. PMLR, 10347–10357

2021
[61]

Tao Tu, Shekoofeh Azizi, Danny Driess, Mike Schaekermann, Mohamed Amin, Pi-Chuan Chang, Andrew Carroll, Charles Lau, Ryutaro Tanno, Ira Ktena, et al. 2024. Towards generalist biomedical AI.Nejm Ai1, 3 (2024), AIoa2300138

2024
[62]

Changhan Wang, Xinchen Yan, Max Smith, Kanika Kochhar, Marcie Rubin, Stephen M Warren, James Wrobel, and Honglak Lee. 2015. A unified framework for automatic wound segmentation and analysis with deep convolutional neural networks. InProc Int’l Conf. Engr. Medicine and Biology (EMBC). IEEE, 2415–2418

2015
[63]

Yiping Wang, Qing Yang, Zhiyuan Zeng, Liliang Ren, Liyuan Liu, Baolin Peng, Hao Cheng, Xuehai He, Kuan Wang, Jianfeng Gao, et al
[64]

Reinforcement learning for reasoning in large language models with one training example, 2025

Reinforcement learning for reasoning in large language models with one training example.arXiv preprint arXiv:2504.20571(2025). Infection-Reasoner: A Compact Vision-Language Model for Wound Infection Classification with Evidence-Grounded Clinical Reasoning• 27

work page arXiv 2025
[65]

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. 2022. Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems35 (2022), 24824–24837

2022
[66]

Wayne A Wilbright, James A Birke, Charles A Patout, Myra Varnado, and Ron Horswell. 2004. The use of telemedicine in the management of diabetes-related foot ulceration: a pilot study.Advances in skin & wound care17, 5 (2004), 232–238

2004
[67]

Jason T Wiseman, Sara Fernandes-Taylor, Rebecca Gunter, Maggie L Barnes, Richard Scott Saunders, Paul J Rathouz, Dai Yamanouchi, and K Craig Kent. 2016. Inter-rater agreement and checklist validation for postoperative wound assessment using smartphone images in vascular surgery.Journal of Vascular Surgery: Venous and Lymphatic Disorders4, 3 (2016), 320–328

2016
[68]

World Union of Wound Healing Societies (WUWHS). 2016. Advances in Wound Care: The Triangle of Wound Assess- ment. Wounds International, Florence Congress Position Document. https://wounds-me.com/wp-content/uploads/2023/02/ 963bb683c483c1fd1e33c7af6901657a.pdf

2016
[69]

Weiwen Xu, Hou Pong Chan, Long Li, Mahani Aljunied, Ruifeng Yuan, Jianyu Wang, Chenghao Xiao, Guizhen Chen, Chaoqun Liu, Zhaodonghui Li, et al. 2025. Lingshu: A Generalist Foundation Model for Unified Multimodal Medical Understanding and Reasoning. arXiv preprint arXiv:2506.07044(2025)

work page internal anchor Pith review arXiv 2025
[70]

Moi Hoon Yap, Bill Cassidy, Joseph M Pappachan, Claire O’Shea, David Gillespie, and Neil D Reeves. 2021. Analysis towards classification of infection and ischaemia of diabetic foot ulcers. In2021 IEEE EMBS International Conference on Biomedical and Health Informatics (BHI). IEEE, 1–4

2021
[71]

Yang Yue, Zhiqi Chen, Rui Lu, Andrew Zhao, Zhaokai Wang, Shiji Song, and Gao Huang. 2025. Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?arXiv preprint arXiv:2504.13837(2025)

work page internal anchor Pith review arXiv 2025
[72]

Hongbo Zhang, Junying Chen, Feng Jiang, Fei Yu, Zhihong Chen, Guiming Chen, Jianquan Li, Xiangbo Wu, Zhang Zhiyi, Qingying Xiao, et al. 2023. Huatuogpt, towards taming language model to be a doctor. InFindings of the association for computational linguistics: EMNLP 2023. 10859–10885

2023
[73]

Med-rlvr: Emerging medical reasoning from a 3b base model via reinforcement learning.arXiv preprint arXiv:2502.19655, 2025

Sheng Zhang, Qianchu Liu, Guanghui Qin, Tristan Naumann, and Hoifung Poon. 2025. Med-rlvr: Emerging medical reasoning from a 3b base model via reinforcement learning.arXiv preprint arXiv:2502.19655(2025). 8 Appendix 8.1 MLLM-as-a-Judge Prompt Templates System Prompt Template You are a meticulous wound-rationale evaluator. You will be given: (1) a wound im...

work page arXiv 2025
[74]

cellulitis_spreading_redness
[75]

necrotic_tissue_eschar
[76]

- Never use the text to invent visual evidence

maceration Important rules: - Never use the image to change TEXT_CLAIM. - Never use the text to invent visual evidence. - Be conservative: prefer UNC over guessing. - Return STRICT JSON only. - No markdown, no prose, no code fences. User Prompt Template Inputs:
[77]

Wound image: (attached)
[78]

parsed_think

Model generation text: Infection-Reasoner: A Compact Vision-Language Model for Wound Infection Classification with Evidence-Grounded Clinical Reasoning• 29 {GENERATION_TEXT} Return STRICT JSON with this exact schema: { "parsed_think": "string", "rubric": { "purulence_pus": { "TEXT_CLAIM":"POS|NEG|UNC|NOT_MENTIONED", "text_span":"string", "IMAGE_EVIDENCE":...