arxiv: 2604.18967 · v2 · submitted 2026-04-21 · 💻 cs.CV

Recognition: unknown

CXRMate-2: Structured Multimodal Temporal Embeddings and Tractable Reinforcement Learning for Clinically Acceptable Chest X-ray Radiology Report Generation

Aaron Nicolson , Elizabeth J. Cooper , Hwan-Jin Yoon , Claire McCafferty , Ramya Krishnan , Michelle Craigie , Nivene Saad , Jason Dowling

show 2 more authors

Ian A. Scott Bevan Koopman

Authors on Pith no claims yet

Pith reviewed 2026-05-10 03:36 UTC · model grok-4.3

classification 💻 cs.CV

keywords chest x-rayradiology report generationmultimodal temporal embeddingsreinforcement learningclinical acceptabilityMIMIC-CXRsemantic alignmentGRPO

0 comments

The pith

Chest X-ray report model reaches 45% acceptability in blinded radiologist ratings with no preference difference on seven of eight findings.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that CXRMate-2 produces radiology reports for chest X-rays that radiologists find acceptable nearly half the time by using structured multimodal temporal embeddings and high-resolution visual feature compression to enable tractable reinforcement learning. This setup lets an LLM decoder draw on visual, textual, and prior-study context in one unified conditioning step, then applies group relative policy optimisation with a semantic-alignment reward to refine outputs. A sympathetic reader would care because automated reports could ease the workload of writing detailed clinical notes if they prove reliable enough to assist rather than replace expert judgment. The evaluation on 120 MIMIC-CXR studies shows generated reports match radiologist preference rates for most findings and win on readability, while radiologist reports win mainly on recall of details.

Core claim

CXRMate-2 achieves statistically significant gains over prior models on GREEN and RadGraph-XL across MIMIC-CXR, CheXpert Plus, and ReXgradient, then demonstrates in a randomised blinded review by three consultant radiologists that its reports on 120 studies were rated acceptable (preferred or equal) in 45% of cases, with no statistically significant preference difference for seven of the eight analysed findings. The model enables this outcome through structured multimodal temporal embeddings that support efficient GRPO reinforcement learning and a reward function aimed at semantic alignment with radiologist text. The authors conclude that higher recall of subtle findings remains the main gap

What carries the argument

Structured multimodal temporal embeddings with high-resolution visual feature compression that together allow efficient unified conditioning of an LLM decoder on visual, textual, and temporal context from a study and its prior, enabling tractable GRPO reinforcement learning.

If this is right

Statistically significant metric gains on GREEN (11.2%) and RadGraph-XL (24.4%) over MedGemma 1.5 on MIMIC-CXR.
Generated reports are consistently preferred for readability while radiologist reports are preferred for recall.
The remaining barrier to non-inferiority is detection of subtle findings.
The approach positions CXR RRG for prospective evaluation inside assistive, radiologist-led workflows.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If recall of subtle findings can be improved without losing readability, the model could move from 45% acceptability toward routine assistive use in high-volume reporting.
The same temporal-embedding structure might transfer to other longitudinal imaging tasks such as CT follow-up where prior context matters.
Real-world deployment would require monitoring whether radiologists adjust their own reporting style when assisted by the model.

Load-bearing premise

The blinded qualitative ratings by three consultant radiologists on 120 MIMIC-CXR studies represent what would be found in wider clinical practice and that the GRPO reward function improves true semantic alignment without hidden post-hoc adjustments.

What would settle it

A follow-up study with at least 500 diverse cases and more than three radiologists that finds acceptability below 30% or statistically significant preference gaps on four or more findings would refute the claim of a clear pathway to clinical acceptability.

Figures

Figures reproduced from arXiv: 2604.18967 by Aaron Nicolson, Bevan Koopman, Claire McCafferty, Elizabeth J. Cooper, Hwan-Jin Yoon, Ian A. Scott, Jason Dowling, Michelle Craigie, Nivene Saad, Ramya Krishnan.

**Figure 1.** Figure 1: CXRMate-2 for CXR RRG; its rich clinical context spans both the study and its prior. Inputs include all available CXRs from both timepoints; report sections from the study (indication, history, comparison, and technique); findings and impression from the prior report; and the time delta between them. This is efficiently provided to the LLM decoder via structured multimodal temporal embeddings that integrat… view at source ↗

**Figure 2.** Figure 2: GRPO for CXR RRG. The final stage of fine-tuning for CXRMate-2 constitutes GRPO with our proposed composite reward function comprising RaTEScore, CXR-BERT, BERTScore, and ARN, which increases semantic alignment with radiologist reports. CXRMate-2 SFT refers to the model obtained after supervised fine-tuning, before to the GRPO stage. instructions. This reduces input size and self-attention computational c… view at source ↗

**Figure 3.** Figure 3: Findings in the 120 selected studies (eight findings were targeted for analysis—those shown in the figure that occur in 30 studies or more). Note that a study can have one or more findings present. 32 5 4 4 3 2 1 3 2 2 2 2 2 1 1 1 1 1 1 1 6 2 2 2 2 1 1 1 1 1 1 1 1 1 1 3 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 10 20 30 No Finding Pulmonary edema Pneumonia Simple pneumothorax Pulmonary congestion Atelect… view at source ↗

**Figure 4.** Figure 4: UpSet plot of the 120 selected studies—only the eight targeted findings described in Subsection 4.1.3 are included in the plot. Sorted by degree, then count. A study was excluded if any of these conditions were not met. Figure A1 shows the header of the autorater prompt; the remainder of the prompt consisted of the radiology reports of the study and its prior, their respective dates and times, and the time… view at source ↗

**Figure 5.** Figure 5: Distributions for the 120 selected studies. Top: Number of findings per study (all findings of SRR-BERT are considered here, unlike in [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗

**Figure 6.** Figure 6: Study selection flow diagram. One study was presented to the rater at a time. There was no time limitation imposed on the raters. They first reviewed the available information for the study, including: • All the CXRs from the study and its prior; • The indication and history sections from the radiology reports of the study and its prior; • The findings and impression sections from the radiologist report o… view at source ↗

**Figure 7.** Figure 7: Preferences from three consultant radiologist raters for the 120 selected studies. 18 62 36 37 19 132 6 19 7 9 3 49 6 19 17 8 5 42 3 15 15 8 10 41 7 13 14 7 4 51 7 15 16 11 9 48 13 6 14 3 25 11 21 13 9 3 35 10 17 17 8 4 39 Pulmonary congestion Pulmonary edema Simple pleural effusion Simple pneumothorax Overall Atelectasis Cardiomegaly No Finding Pneumonia Reason Precision Recall Readability Radiologist Gen… view at source ↗

**Figure 8.** Figure 8: Reasons for the preferences from three consultant radiologist raters for the 120 selected studies. Following Piaggio et al. (2012), non-inferiority is defined relative to a threshold of 50%, corresponding to the setting in which generated reports are not judged inferior in at least half of the studies. Under this definition, the probability of a generated report being acceptable is equal to the probabilit… view at source ↗

**Figure 9.** Figure 9: Proportion of inter-rater agreement of preferences between the three consultant radiologist raters for the 120 selected studies [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗

**Figure 10.** Figure 10: , a one-sided binomial proportion test against a null hypothesis (𝑃 = 0.5) indicates that the current sample of 360 preferences yields a power of 60.8%. Current N=360 (60.8%) Target N=600 (82.3%) 50% 60% 70% 80% 90% 100% 400 500 600 700 800 Number of preferences (N) Statistical power [PITH_FULL_IMAGE:figures/full_fig_p017_10.png] view at source ↗

**Figure 11.** Figure 11: Generated and radiologist reports for studies with different preferences. Only a frontal view of each study is shown. Strikethrough indicates an incorrect finding relative to the preferred report. Corresponding sentences are highlighted with matching colours across the radiologist and generated reports to aid comparison. Square brackets indicate modifications to the original text. : Preprint submitted to … view at source ↗

read the original abstract

Chest X-ray (CXR) radiology report generation (RRG) models have shown rapid progress on automated metrics, yet their clinical utility remains uncertain due to limited qualitative evaluation by radiologists. We present CXRMate-2, a state-of-the-art CXR RRG model that enables tractable reinforcement learning (RL) through structured multimodal temporal embeddings and high-resolution visual feature compression, for efficient, unified conditioning of an LLM decoder on visual, textual, and temporal context from a study and its prior. This enables group relative policy optimisation (GRPO), where a proposed reward function is used to improve semantic alignment with radiologist reports. Across the MIMIC-CXR, CheXpert Plus, and ReXgradient datasets, CXRMate-2 achieves statistically significant improvements over strong benchmarks, including gains of 11.2% and 24.4% in GREEN and RadGraph-XL, respectively, on MIMIC-CXR relative to MedGemma 1.5 (4B). To directly compare CXRMate-2 against radiologist reporting, we conduct a blinded, randomised qualitative retrospective evaluation. Three consultant radiologists compare generated and radiologist reports across 120 studies from the MIMIC-CXR test set. Generated reports were deemed acceptable (defined as preferred or rated equally to radiologist reports) in 45% of ratings, with no statistically significant difference in preference rates for seven of the eight analysed findings. Preferences for radiologist reports were driven primarily by higher recall, while generated reports were consistently preferred for readability. Together, these results define a clear pathway to clinically acceptable CXR RRG. Improving recall and the detection of subtle findings represents the primary remaining barrier to non-inferiority with radiologist reporting, positioning CXR RRG for prospective evaluation in assistive, radiologist-led workflows.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CXRMate-2 delivers measurable metric gains and a real radiologist preference study, but the clinical acceptability conclusion rests on a small evaluation that treats non-significance as parity without proper checks.

read the letter

The paper's core advance is making reinforcement learning practical for chest X-ray report generation by using structured multimodal temporal embeddings plus high-resolution compression to condition an LLM decoder. This lets them run GRPO with a custom reward aimed at semantic alignment. On MIMIC-CXR they report clear lifts over MedGemma 1.5: 11.2% on GREEN and 24.4% on RadGraph-XL, plus similar patterns on the other two datasets. That part is straightforward engineering progress and worth noting for anyone tracking these models. They also run a blinded study with three consultant radiologists on 120 MIMIC-CXR cases, which is more than most papers in this area bother with. Generated reports came out acceptable in 45% of ratings and were liked for readability, while radiologist reports won on recall. That gives a concrete data point instead of just automated scores. The architecture choices and the GRPO application look like genuine extensions rather than minor tweaks. The evaluation design itself is a step in the right direction for the field. The main weakness is the interpretation of the preference results. They state no statistically significant difference on seven of eight findings and treat that as evidence the model is approaching clinical level. With only 120 studies and three readers, the study is underpowered for that claim. Non-significance can easily come from low sensitivity rather than true equivalence, and they do not appear to have reported power analysis, confidence intervals on the proportions, or equivalence testing. The abstract also notes that radiologist preference was driven by higher recall, which suggests the gap on subtle findings is still material. The reward function details are thin in the available text, so it is hard to judge how much the GRPO step actually captures clinical priorities versus fitting the training distribution. This work is for groups already building or evaluating radiology report models who need to see concrete RL results and human feedback numbers. It is not a breakthrough but it has enough new pieces and evaluation effort to justify sending it to referees rather than desk rejecting. A revised version that adds proper statistical checks on the preference data and more reward transparency would be a useful addition to the literature.

Referee Report

2 major / 2 minor

Summary. The manuscript presents CXRMate-2, a chest X-ray radiology report generation model that uses structured multimodal temporal embeddings and high-resolution visual feature compression to enable tractable reinforcement learning via group relative policy optimization (GRPO) with a custom reward function for improved semantic alignment. It reports statistically significant gains on automated metrics including 11.2% in GREEN and 24.4% in RadGraph-XL on MIMIC-CXR relative to MedGemma 1.5 (4B), with similar results on CheXpert Plus and ReXgradient. A blinded randomized qualitative evaluation by three consultant radiologists on 120 MIMIC-CXR studies finds generated reports acceptable (preferred or equal) in 45% of ratings, with no statistically significant preference differences versus radiologist reports for seven of eight findings; radiologist reports are preferred for recall while generated reports score higher on readability. The work concludes that improving recall of subtle findings is the main remaining barrier to non-inferiority.

Significance. If the results hold, the paper is significant for directly linking metric improvements to a blinded radiologist preference study that quantifies clinical acceptability and isolates recall as the primary gap. The technical contributions around temporal embeddings and tractable GRPO provide a concrete mechanism for aligning generated reports with radiologist style, and the identification of readability advantages offers a practical insight for assistive deployment. These elements, combined with multi-dataset evaluation, strengthen the case for progressing CXR report generation toward prospective clinical testing.

major comments (2)

[Qualitative evaluation (blinded retrospective study on 120 MIMIC-CXR cases)] The interpretation of no statistically significant difference in preference rates for seven of the eight findings as supporting clinical acceptability (45% acceptable rate) is load-bearing for the central claim yet rests on a sample of 120 studies rated by three radiologists without reported power analysis, equivalence testing (e.g., TOST), or confidence intervals on the preference proportions. Non-significance may reflect insufficient sensitivity rather than true parity, especially given the abstract's note that radiologist preference is driven by higher recall of subtle findings.
[Methods (GRPO and reward function)] The reward function design for GRPO, including its precise formulation, how it avoids embedding data-driven fitting or post-hoc tuning, and its relation to the automated metrics, is insufficiently detailed. This is critical because the semantic alignment improvements and the 45% acceptability claim depend on this component; without explicit equations or ablation on reward components, reproducibility and assessment of selection effects in the 120-study subset are compromised.

minor comments (2)

[Experimental setup] Clarify the exact data splits used for training, validation, and the 120-study qualitative subset, including any exclusion criteria or selection effects that might affect generalizability to broader clinical practice.
[Abstract and results] The abstract states gains relative to MedGemma 1.5 (4B); providing parameter counts and architectural details for all baselines in a table would improve context for the reported metric improvements.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment below with point-by-point responses and indicate the revisions planned for the next manuscript version.

read point-by-point responses

Referee: [Qualitative evaluation (blinded retrospective study on 120 MIMIC-CXR cases)] The interpretation of no statistically significant difference in preference rates for seven of the eight findings as supporting clinical acceptability (45% acceptable rate) is load-bearing for the central claim yet rests on a sample of 120 studies rated by three radiologists without reported power analysis, equivalence testing (e.g., TOST), or confidence intervals on the preference proportions. Non-significance may reflect insufficient sensitivity rather than true parity, especially given the abstract's note that radiologist preference is driven by higher recall of subtle findings.

Authors: We acknowledge the referee's concern regarding the statistical interpretation of the qualitative results. The 120-study sample follows common practice in radiology AI reader studies, but we agree that confidence intervals and power considerations should be reported explicitly. In the revised manuscript we will add 95% confidence intervals for all preference proportions and include a post-hoc power discussion based on the observed effect sizes. While we did not perform TOST equivalence testing, the consistent pattern (no significant differences on seven of eight findings, with the sole driver of radiologist preference being recall of subtle findings) supports our conclusion that recall remains the primary barrier. We will expand the limitations section to qualify the acceptability claim and note the retrospective, exploratory nature of the evaluation. revision: partial
Referee: [Methods (GRPO and reward function)] The reward function design for GRPO, including its precise formulation, how it avoids embedding data-driven fitting or post-hoc tuning, and its relation to the automated metrics, is insufficiently detailed. This is critical because the semantic alignment improvements and the 45% acceptability claim depend on this component; without explicit equations or ablation on reward components, reproducibility and assessment of selection effects in the 120-study subset are compromised.

Authors: We thank the referee for identifying this gap in methodological detail. In the revised manuscript we will provide the complete mathematical formulation of the reward function, including each term (semantic alignment via RadGraph-XL and GREEN components, readability, and length penalties) and the fixed weighting scheme. The design uses clinically motivated, pre-defined metrics rather than any fitting or tuning on the test or qualitative-evaluation sets, thereby avoiding data-driven selection effects. We will also add ablation results isolating the contribution of each reward component to both automated metric gains and the qualitative acceptability rates. These additions will directly support reproducibility and allow readers to assess any influence on the 120-study subset. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation and claims rest on independent evaluation

full rationale

The paper introduces structured embeddings and GRPO with a proposed reward function to align generated reports to radiologist references, then reports gains on GREEN and RadGraph-XL plus a separate blinded radiologist preference study on 120 MIMIC-CXR cases. No equation or step reduces to its own inputs by construction: the reward drives RL training but the primary clinical-acceptability claim (45% acceptable, no sig diff on 7/8 findings) is measured by external human raters rather than the automated metrics or reward itself. Automated metric gains are presented as empirical outcomes, not tautological predictions. Self-citations are absent from the provided text and the evaluation design supplies independent grounding.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based on abstract only; no explicit free parameters, axioms, or invented entities are identifiable. The central claim rests on empirical metric gains and a small-scale qualitative study rather than new theoretical constructs.

pith-pipeline@v0.9.0 · 5687 in / 1328 out tokens · 36956 ms · 2026-05-10T03:36:16.451032+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

70 extracted references · 55 canonical work pages · 3 internal anchors

[1]

Bannur, Shruthi and Bouzid, Kenza and Castro, Daniel C. and Schwaighofer, Anton and Bond-Taylor, Sam and Ilse, Maximilian and Pérez-García, Fernando and Salvatelli, Valentina and Sharma, Harshita and Meissen, Felix and Ranjit, Mercy and Srivastav, Shaury and Gong, Julia and Falck, Fabian and Oktay, Ozan and Thieme, Anja and Lungren, Matthew P. and Wetsche...
[2]

RadioGraphics , author =

Understanding and. RadioGraphics , author =. 2022 , pages =. doi:10.1148/rg.220037 , number =

work page doi:10.1148/rg.220037 2022
[3]

Journal of Medical Imaging and Radiation Oncology , author =

Occupational burnout among radiographers, sonographers and radiologists in. Journal of Medical Imaging and Radiation Oncology , author =. 2017 , keywords =. doi:https://doi.org/10.1111/1754-9485.12547 , abstract =

work page doi:10.1111/1754-9485.12547 2017
[4]

Nicolson, Aaron and Zhuang, Shengyao and Dowling, Jason and Koopman, Bevan , editor =. The. Proceedings of the 63rd. 2025 , pages =. doi:10.18653/v1/2025.acl-long.9 , abstract =

work page doi:10.18653/v1/2025.acl-long.9 2025
[5]

Proceedings of the 2024

Zhao, Weike and Wu, Chaoyi and Zhang, Xiaoman and Zhang, Ya and Wang, Yanfeng and Xie, Weidi , editor =. Proceedings of the 2024. 2024 , pages =. doi:10.18653/v1/2024.emnlp-main.836 , abstract =

work page doi:10.18653/v1/2024.emnlp-main.836 2024
[6]

Radiology , author =

Clinical. Radiology , author =. 2025 , pages =. doi:10.1148/radiol.250568 , abstract =

work page doi:10.1148/radiol.250568 2025
[7]

Korean J Radiol , author =

Artificial. Korean J Radiol , author =. 2025 , keywords =

2025
[8]

Radiology , author =

Value of. Radiology , author =. 2025 , pages =. doi:10.1148/radiol.241646 , abstract =

work page doi:10.1148/radiol.241646 2025
[9]

JAMA Network Open , author =

Generative. JAMA Network Open , author =. 2023 , pages =. doi:10.1001/jamanetworkopen.2023.36100 , abstract =

work page doi:10.1001/jamanetworkopen.2023.36100 2023
[10]

MIMIC-IV-ED

Johnson, Alistair and Bulgarelli, Lucas and Pollard, Tom and Celi, Leo Anthony and Mark, Roger and Horng, Steven , year =. doi:10.13026/5NTK-KM72 , abstract =

work page doi:10.13026/5ntk-km72
[11]

MIMIC-IV , a freely accessible electronic health record dataset,

Scientific Data , author =. 2023 , pages =. doi:10.1038/s41597-022-01899-x , abstract =

work page doi:10.1038/s41597-022-01899-x 2023
[12]

doi:10.71718/6NVZ-PM34 , abstract =

Chambon, Pierre and Delbrouck, Jean-Benoit and Sounack, Thomas and Huang, Shih-Cheng and Chen, Zhihong and Varma, Maya and Truong, Steven QH and Chuong, Chu The and Langlotz, Curtis P , year =. doi:10.71718/6NVZ-PM34 , abstract =

work page doi:10.71718/6nvz-pm34
[13]

Qiao Jin, Bhuwan Dhingra, Zhengping Liu, William W Cohen, and Xinghua Lu

Scientific Data , author =. 2019 , pages =. doi:10.1038/s41597-019-0322-0 , abstract =

work page doi:10.1038/s41597-019-0322-0 2019
[14]

Zhang, Xiaoman and Acosta, Julian N and Miller, Josh and Huang, Ouwen and Rajpurkar, Pranav , year =
[15]

Hu, Edward and Shen, Yelong and Wallis, Phillip and Allen-Zhu, Zeyuan and Li, Yuanzhi and Wang, Shean and Wang, Lu and Chen, Weizhu , year =
[16]

Automated

Delbrouck, Jean-Benoit and Xu, Justin and Moll, Johannes and Thomas, Alois and Chen, Zhihong and Ostmeier, Sophie and Azhar, Asfandyar and Li, Kelvin Zhenghao and Johnston, Andrew and Bluethgen, Christian and Reis, Eduardo Pontes and Muneer, Mohamed S and Varma, Maya and Langlotz, Curtis , editor =. Automated. Proceedings of the 63rd. 2025 , pages =. doi:...

work page doi:10.18653/v1/2025.acl-long.1301 2025
[17]

Flamingo: a visual language model for few-shot learning , isbn =

Alayrac, Jean-Baptiste and Donahue, Jeff and Luc, Pauline and Miech, Antoine and Barr, Iain and Hasson, Yana and Lenc, Karel and Mensch, Arthur and Millicah, Katie and Reynolds, Malcolm and Ring, Roman and Rutherford, Eliza and Cabi, Serkan and Han, Tengda and Gong, Zhitao and Samangooei, Sina and Monteiro, Marianne and Menick, Jacob and Borgeaud, Sebasti...
[18]

Gemini 2.5:

Comanici, Gheorghe and Bieber, Eric and Schaekermann, Mike and Pasupat, Ice and Sachdeva, Noveen and Dhillon, Inderjit and Blistein, Marcel and. Gemini 2.5:. 2025 , note =

2025
[19]

Improving

Miura, Yasuhide and Zhang, Yuhao and Tsai, Emily and Langlotz, Curtis and Jurafsky, Dan , editor =. Improving. Proceedings of the 2021. 2021 , pages =. doi:10.18653/v1/2021.naacl-main.416 , abstract =

work page doi:10.18653/v1/2021.naacl-main.416 2021
[20]

Xu Chu, Xinrong Chen, Guanyu Wang, Zhijie Tan, Kui Huang, Wenyu Lv, Tong Mo, and Weiping Li

Delbrouck, Jean-Benoit and Chambon, Pierre and Bluethgen, Christian and Tsai, Emily and Almusa, Omar and Langlotz, Curtis , editor =. Improving the. Findings of the. 2022 , pages =. doi:10.18653/v1/2022.findings-emnlp.319 , abstract =

work page doi:10.18653/v1/2022.findings-emnlp.319 2022
[21]

Proceedings of the

Jain, Saahil and Agrawal, Ashwin and Saporta, Adriel and Truong, Steven and Duong, Du Nguyen Duong Nguyen and Bui, Tan and Chambon, Pierre and Zhang, Yuhao and Lungren, Matthew and Ng, Andrew and Langlotz, Curtis and Rajpurkar, Pranav and Rajpurkar, Pranav , editor =. Proceedings of the
[22]

Tanno, Ryutaro and Barrett, David G. T. and Sellergren, Andrew and Ghaisas, Sumedh and Dathathri, Sumanth and See, Abigail and Welbl, Johannes and Lau, Charles and Tu, Tao and Azizi, Shekoofeh and Singhal, Karan and Schaekermann, Mike and May, Rhys and Lee, Roy and Man, SiWai and Mahdavi, Sara and Ahmed, Zahra and Matias, Yossi and Barral, Joelle and Esla...

work page doi:10.1038/s41591-024-03302-1
[23]

arXiv preprint arXiv:2311.13668 , year=

Hyland, Stephanie L. and Bannur, Shruthi and Bouzid, Kenza and Castro, Daniel C. and Ranjit, Mercy and Schwaighofer, Anton and Pérez-García, Fernando and Salvatelli, Valentina and Srivastav, Shaury and Thieme, Anja and Codella, Noel and Lungren, Matthew P. and Wetscherek, Maria Teodora and Oktay, Ozan and Alvarez-Valle, Javier , month = apr, year =. doi:1...

work page doi:10.48550/arxiv.2311.13668
[24]

, month = jan, year =

Aldhafeeri, Faten M. , month = jan, year =. Governing. Diagnostics , publisher =. doi:10.3390/diagnostics15182300 , abstract =

work page doi:10.3390/diagnostics15182300
[25]

2023 , pages =

British Journal of Radiology , author =. 2023 , pages =. doi:10.1259/bjr.20230023 , abstract =

work page doi:10.1259/bjr.20230023 2023
[26]

Implementing artificial intelligence algorithms in the radiology workflow: Challenges and considerations.Mayo Clinic Proceedings: Digital Health, 3(1):100188, 2025

Implementing. Mayo Clinic Proceedings: Digital Health , author =. 2025 , pages =. doi:10.1016/j.mcpdig.2024.100188 , abstract =

work page doi:10.1016/j.mcpdig.2024.100188 2025
[27]

Automation

Dratsch, Thomas and Chen, Xue and Rezazade Mehrizi, Mohammad and Kloeckner, Roman and Mähringer-Kunz, Aline and Püsken, Michael and Baeßler, Bettina and Sauer, Stephanie and Maintz, David and Pinto dos Santos, Daniel , month = may, year =. Automation. Radiology , publisher =. doi:10.1148/radiol.222176 , abstract =

work page doi:10.1148/radiol.222176
[28]

Journal of Safety Science and Resilience , author =

Exploring the risks of automation bias in healthcare artificial intelligence applications:. Journal of Safety Science and Resilience , author =. 2024 , keywords =. doi:10.1016/j.jnlssr.2024.06.001 , abstract =

work page doi:10.1016/j.jnlssr.2024.06.001 2024
[29]

Korean Journal of Radiology , author =

Real-. Korean Journal of Radiology , author =. 2025 , pages =. doi:10.3348/kjr.2025.0962 , abstract =

work page doi:10.3348/kjr.2025.0962 2025
[30]

RadioGraphics , author =

How to. RadioGraphics , author =. 2020 , pages =. doi:10.1148/rg.2020200020 , language =

work page doi:10.1148/rg.2020200020 2020
[31]

Nature Machine Intelligence , author =

Exploring scalable medical image encoders beyond text supervision , volume =. Nature Machine Intelligence , author =. 2025 , pages =. doi:10.1038/s42256-024-00965-w , language =

work page doi:10.1038/s42256-024-00965-w 2025
[32]

Journal of the Royal Society of Medicine , author =

Lateral. Journal of the Royal Society of Medicine , author =. 2005 , pages =. doi:10.1177/014107680509800705 , language =

work page doi:10.1177/014107680509800705 2005
[33]

Informatics in Medicine Unlocked , author =

Longitudinal data and a semantic similarity reward for chest. Informatics in Medicine Unlocked , author =. 2024 , pages =. doi:10.1016/j.imu.2024.101585 , language =

work page doi:10.1016/j.imu.2024.101585 2024
[34]

Kevin and Xiao, Li , editor =

Wu, Xian and Yang, Shuxin and Qiu, Zhaopeng and Ge, Shen and Yan, Yangtian and Wu, Xingwang and Zheng, Yefeng and Zhou, S. Kevin and Xiao, Li , editor =. Proceedings of the 29th. 2022 , pages =

2022
[35]

, volume =

The chest radiograph. , volume =. The Ulster medical journal , author =. 2012 , keywords =

2012
[36]

Assran, Q

Bannur, Shruthi and Hyland, Stephanie and Liu, Qianchu and Pérez-García, Fernando and Ilse, Maximilian and Castro, Daniel C. and Boecking, Benedikt and Sharma, Harshita and Bouzid, Kenza and Thieme, Anja and Schwaighofer, Anton and Wetscherek, Maria and Lungren, Matthew P. and Nori, Aditya and Alvarez-Valle, Javier and Oktay, Ozan , month = jun, year =. L...

work page doi:10.1109/cvpr52729.2023.01442 2023
[37]

Proceedings of the 61st

Liu, Shuai and Cho, Hyundong and Freedman, Marjorie and Ma, Xuezhe and May, Jonathan , editor =. Proceedings of the 61st. 2023 , pages =. doi:10.18653/v1/2023.acl-long.468 , abstract =

work page doi:10.18653/v1/2023.acl-long.468 2023
[38]

Pragmatic

Nguyen, Dang and Chen, Chacha and He, He and Tan, Chenhao , month = dec, year =. Pragmatic. Proceedings of the 3rd
[39]

Qin, Zhen and Han, Xiaodong and Sun, Weixuan and Li, Dongxu and Kong, Lingpeng and Barnes, Nick and Zhong, Yiran , editor =. The. Proceedings of the 2022. 2022 , pages =. doi:10.18653/v1/2022.emnlp-main.473 , abstract =

work page doi:10.18653/v1/2022.emnlp-main.473 2022
[40]

Nicolson, Aaron and Liu, Jinghui and Dowling, Jason and Nguyen, Anthony and Koopman, Bevan , editor =. e-. Proceedings of the 23rd. 2024 , pages =. doi:10.18653/v1/2024.bionlp-1.8 , abstract =

work page doi:10.18653/v1/2024.bionlp-1.8 2024
[41]

Clinically

Liu, Guanxiong and Hsu, Tzu-Ming Harry and McDermott, Matthew and Boag, Willie and Weng, Wei-Hung and Szolovits, Peter and Ghassemi, Marzyeh , month = oct, year =. Clinically. Proceedings of the 4th
[42]

MedGemma Technical Report

Sellergren, Andrew and Kazemzadeh, Sahar and Jaroensri, Tiam and Kiraly, Atilla and Traverse, Madeleine and Kohlberger, Timo and Xu, Shawn and Jamil, Fayaz and Hughes, Cían and Lau, Charles and Chen, Justin and Mahvar, Fereshteh and Yatziv, Liron and Chen, Tiffany and Sterling, Bram and Baby, Stefanie Anna and Baby, Susanna Maria and Lai, Jeremy and Schmi...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2507.05201
[43]

and Marcheret, Etienne and Mroueh, Youssef and Ross, Jerret and Goel, Vaibhava , month = jul, year =

Rennie, Steven J. and Marcheret, Etienne and Mroueh, Youssef and Ross, Jerret and Goel, Vaibhava , month = jul, year =. Self-. 2017. doi:10.1109/CVPR.2017.131 , abstract =

work page doi:10.1109/cvpr.2017.131 2017
[44]

Shao, Zhihong and Wang, Peiyi and Zhu, Qihao and Xu, Runxin and Song, Junxiao and Bi, Xiao and Zhang, Haowei and Zhang, Mingchuan and Li, Y. K. and Wu, Y. and Guo, Daya , month = apr, year =. doi:10.48550/arXiv.2402.03300 , abstract =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2402.03300
[45]

https://doi.org/10.1016/j.neucom.2023.127063

Neurocomput. , author =. doi:10.1016/j.neucom.2023.127063 , number =

work page doi:10.1016/j.neucom.2023.127063 2023
[46]

Papineni, Kishore and Roukos, Salim and Ward, Todd and Zhu, Wei-Jing , editor =. Bleu: a. Proceedings of the 40th. 2002 , pages =. doi:10.3115/1073083.1073135 , urldate =

work page doi:10.3115/1073083.1073135 2002
[47]

Automatic

Lin, Chin-Yew and Och, Franz Josef , month = jul, year =. Automatic. Proceedings of the 42nd. doi:10.3115/1218955.1219032 , urldate =

work page doi:10.3115/1218955.1219032
[48]

and Artzi, Yoav , month = sep, year =

Zhang, Tianyi and Kishore, Varsha and Wu, Felix and Weinberger, Kilian Q. and Artzi, Yoav , month = sep, year =
[49]

and Schwaighofer, Anton and Hyland, Stephanie and Wetscherek, Maria and Naumann, Tristan and Nori, Aditya and Alvarez-Valle, Javier and Poon, Hoifung and Oktay, Ozan , editor =

Boecking, Benedikt and Usuyama, Naoto and Bannur, Shruthi and Castro, Daniel C. and Schwaighofer, Anton and Hyland, Stephanie and Wetscherek, Maria and Naumann, Tristan and Nori, Aditya and Alvarez-Valle, Javier and Poon, Hoifung and Oktay, Ozan , editor =. Making the. Computer. 2022 , note =. doi:10.1007/978-3-031-20059-5_1 , language =

work page doi:10.1007/978-3-031-20059-5_1 2022
[50]

Proceedings of the 2025

Xu, Justin and Zhang, Xi and Abderezaei, Javid and Bauml, Julie and Boodoo, Roger and Haghighi, Fatemeh and Ganjizadeh, Ali and Brattain, Eric and Van Veen, Dave and Meng, Zaiqiao and Eyre, David W and Delbrouck, Jean-Benoit , editor =. Proceedings of the 2025. 2025 , pages =. doi:10.18653/v1/2025.emnlp-demos.40 , abstract =

work page doi:10.18653/v1/2025.emnlp-demos.40 2025
[51]

Combining

Smit, Akshay and Jain, Saahil and Rajpurkar, Pranav and Pareek, Anuj and Ng, Andrew and Lungren, Matthew , year =. Combining. Proceedings of the 2020. doi:10.18653/v1/2020.emnlp-main.117 , abstract =

work page doi:10.18653/v1/2020.emnlp-main.117 2020
[52]

Findings of the

Delbrouck, Jean-Benoit and Chambon, Pierre and Chen, Zhihong and Varma, Maya and Johnston, Andrew and Blankemeier, Louis and Van Veen, Dave and Bui, Tan and Truong, Steven and Langlotz, Curtis , editor =. Findings of the. 2024 , pages =. doi:10.18653/v1/2024.findings-acl.765 , abstract =

work page doi:10.18653/v1/2024.findings-acl.765 2024
[53]

Findings of the

Ostmeier, Sophie and Xu, Justin and Chen, Zhihong and Varma, Maya and Blankemeier, Louis and Bluethgen, Christian and Michalson, Arne Edward and Moseley, Michael and Langlotz, Curtis and Chaudhari, Akshay S and Delbrouck, Jean-Benoit , editor =. Findings of the. 2024 , pages =. doi:10.18653/v1/2024.findings-emnlp.21 , abstract =

work page doi:10.18653/v1/2024.findings-emnlp.21 2024
[54]

and Rajpurkar, Pranav , month = mar, year =

Zhou, Hong-Yu and Acosta, Julián Nicolás and Adithan, Subathra and Datta, Suvrankar and Topol, Eric J. and Rajpurkar, Pranav , month = mar, year =. NEJM AI , publisher =. doi:10.1056/AIoa2500595 , number =

work page doi:10.1056/aioa2500595
[55]

Zhang, Xi and Meng, Zaiqiao and Lever, Jake and Ho, Edmond S. L. , editor =. Libra:. Findings of the. 2025 , pages =. doi:10.18653/v1/2025.findings-acl.888 , abstract =

work page doi:10.18653/v1/2025.findings-acl.888 2025
[56]

Lin, Qika and Zhu, Yifan and Pu, Bin and Huang, Ling and Luo, Haoran and Ma, Jingying and Wu, Feng and He, Kai and Xu, Jiaxing and Peng, Zhen and Zhao, Tianzhe and Xu, Fangzhi and Zhang, Jian and Ou, Zhonghong and Cambria, Erik and Mishra, Swapnil and Feng, Mengling , month = mar, year =. Toward. Proceedings of the
[57]

What’s in the image? a deep-dive into the vision of vision language models

Liu, Kang and Ma, Zhuoqi and Kang, Xiaolu and Li, Yunan and Xie, Kun and Jiao, Zhicheng and Miao, Qiguang , month = jun, year =. Enhanced. 2025. doi:10.1109/CVPR52734.2025.00968 , urldate =

work page doi:10.1109/cvpr52734.2025.00968 2025
[58]

Lin, Qika and Zhu, Yifan and Pu, Bin and Huang, Ling and Luo, Haoran and Ma, Jingying and Wu, Feng and He, Kai and Xu, Jiaxing and Peng, Zhen and Zhao, Tianzhe and Xu, Fangzhi and Zhang, Jian and Ou, Zhonghong and Cambria, Erik and Mishra, Swapnil and Feng, Mengling , month = mar, year =. Toward. doi:10.48550/arXiv.2509.03906 , abstract =

work page doi:10.48550/arxiv.2509.03906
[59]

Gaussian Error Linear Units (GELUs)

Hendrycks, Dan and Gimpel, Kevin , month = jun, year =. Gaussian. doi:10.48550/arXiv.1606.08415 , abstract =

work page Pith review doi:10.48550/arxiv.1606.08415
[60]

Decoupled

Loshchilov, Ilya and Hutter, Frank , year =. Decoupled. International
[61]

and Abi-Ghanem, Alain S

Dasegowda, Giridhar and Kalra, Mannudeep K. and Abi-Ghanem, Alain S. and Arru, Chiara D. and Bernardo, Monica and Saba, Luca and Segota, Doris and Tabrizi, Zhale and Viswamitra, Sanjaya and Kaviani, Parisa and Karout, Lina and Dreyer, Keith J. , month = jan, year =. Suboptimal. Diagnostics , publisher =. doi:10.3390/diagnostics13030412 , abstract =

work page doi:10.3390/diagnostics13030412
[62]

Schroeder, Will and Martin, Ken and Lorensen, Bill , year =. The
[63]

Patterns , author =

Evaluating progress in automatic chest. Patterns , author =. 2023 , keywords =. doi:10.1016/j.patter.2023.100802 , abstract =

work page doi:10.1016/j.patter.2023.100802 2023
[64]

JAMA , author =

Reporting of. JAMA , author =. 2012 , pages =. doi:10.1001/jama.2012.87802 , abstract =

work page doi:10.1001/jama.2012.87802 2012
[65]

InThe Thirty-ninth Annual Conference on Neural Information Processing Systems

Chen, Zhihong and Varma, Maya and Delbrouck, Jean-Benoit and Paschali, Magdalini and Blankemeier, Louis and Veen, Dave Van and Valanarasu, Jeya Maria Jose and Youssef, Alaa and Cohen, Joseph Paul and Reis, Eduardo Pontes and Tsai, Emily B. and Johnston, Andrew and Olsen, Cameron and Abraham, Tanishq Mathew and Gatidis, Sergios and Chaudhari, Akshay S. and...

work page doi:10.48550/arxiv.2401.12208
[66]

doi:10.13026/4jqj-jw95 , journal =

Johnson, Alistair and Pollard, Tom and Mark, Roger and Berkowitz, Seth and Horng, Steven , month = jul, year =. doi:10.13026/4jqj-jw95 , journal =

work page doi:10.13026/4jqj-jw95
[67]

Proceedings of the

Liu, Kang and Ma, Zhuoqi and Fang, Zikang and Li, Yunan and Xie, Kun and Miao, Qiguang , month = mar, year =. Proceedings of the. doi:10.1609/aaai.v40i9.37657 , number =

work page doi:10.1609/aaai.v40i9.37657
[68]

doi:10.13026/jsn5-t979 , journal =

Johnson, Alistair and Lungren, Matthew and Peng, Yifan and Lu, Zhiyong and Mark, Roger and Berkowitz, Seth and Horng, Steven , month = mar, year =. doi:10.13026/jsn5-t979 , journal =

work page doi:10.13026/jsn5-t979
[69]

Artificial Intelligence in Medicine , author =

Improving chest. Artificial Intelligence in Medicine , author =. 2023 , keywords =. doi:https://doi.org/10.1016/j.artmed.2023.102633 , abstract =

work page doi:10.1016/j.artmed.2023.102633 2023
[70]

MedGemma 1.5 Technical Report

Sellergren, Andrew and Gao, Chufan and Mahvar, Fereshteh and Kohlberger, Timo and Jamil, Fayaz and Traverse, Madeleine and Tono, Alberto and Sadjad, Bashir and Yang, Lin and Lau, Charles and Yatziv, Liron and Chen, Tiffany and Sterling, Bram and Philbrick, Kenneth and Tiwari, Richa and Liu, Yun and Jajoo, Madhuram and Sankarapu, Chandrashekar and Vispute,...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2604.05081