ECHO: Efficient Chest X-ray Report Generation with One-step Block Diffusion

Hao Liu; Jile Jiao; Lifeng Chen; Tao Sun; Tianqi You; Xiaofeng Mou; Xiao Han; Xiaojie Jin; Yi Xu; Zhicai Ou

ECHO generates chest X-ray reports via one-step-per-block diffusion while improving semantic scores over autoregressive baselines and delivering up to 8 times faster inference.

Reviewed by Pith at T0; open to challenge. T0 means a machine referee read the full paper against a public rubric. the ladder, T0–T4 →

Challenge this review Re-run · record.json Download PDF Read on arXiv ↗

T0 review · grok-4.3

2026-05-21 08:47 UTC pith:BGHEEHJA

load-bearing objection ECHO shows a practical fix for one-step block diffusion in CXR reports that delivers real speedups while keeping the claims internally consistent.

arxiv 2604.09450 v2 pith:BGHEEHJA submitted 2026-04-10 cs.LG cs.AIeess.IV

ECHO: Efficient Chest X-ray Report Generation with One-step Block Diffusion

Lifeng Chen , Tianqi You , Hao Liu , Zhimin Bao , Jile Jiao , Xiao Han , Zhicai Ou , Tao Sun

show 3 more authors

Xiaofeng Mou Xiaojie Jin Yi Xu

This is my paper

classification cs.LG cs.AIeess.IV

keywords chest x-ray report generationdiffusion modelsvision-language modelsone-step inferenceefficient generationmedical imagingconditional distillation

verification ladder T0 review T1 audit T2 compute T3 formal T4 reserved

The pith

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents ECHO as a diffusion-based vision-language model that produces chest X-ray reports through parallel token generation instead of sequential decoding. It tackles the coherence loss that normally appears when compressing diffusion denoising into a single step by introducing a Direct Conditional Distillation process that supplies unfactorized training signals drawn from on-policy trajectories. A Response-Asymmetric Diffusion strategy is added to keep training efficient. Experiments report large gains on RaTE and SemScore metrics together with substantial latency reduction and little loss in clinical accuracy. A reader would care because faster report generation could meaningfully reduce the time radiologists spend on routine documentation.

Core claim

ECHO enables stable one-step-per-block inference via a novel Direct Conditional Distillation (DCD) framework, which mitigates the mean-field limitation by constructing unfactorized supervision from on-policy diffusion trajectories to encode joint token dependencies. In addition, a Response-Asymmetric Diffusion (RAD) training strategy further improves training efficiency while maintaining model effectiveness.

What carries the argument

Direct Conditional Distillation (DCD) framework that supplies unfactorized supervision from on-policy diffusion trajectories to encode joint token dependencies and overcome mean-field bias in single-step generation

Load-bearing premise

Unfactorized supervision drawn from on-policy diffusion trajectories is sufficient to encode the joint dependencies among report tokens and prevent coherence collapse under one-step inference.

What would settle it

On a held-out chest X-ray test set, ECHO reports exhibit measurably lower clinical accuracy or higher factual error rates than the best autoregressive baseline when both are evaluated by the same radiologist panel.

Watch this falsifier — get emailed when new claim-graph text bears on it.

If this is right

Parallel one-step-per-block decoding replaces sequential token-by-token generation, directly lowering wall-clock inference time.
Semantic metrics RaTE and SemScore rise by more than 60 percent relative to prior autoregressive models.
Clinical accuracy metrics remain nearly unchanged, indicating that speed gains do not trade off diagnostic utility.
Response-Asymmetric Diffusion reduces the number of training steps needed without harming final report quality.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same distillation approach could be tested on other radiology modalities such as CT or MRI report generation.
If the unfactorized supervision pattern generalizes, single-step diffusion might become viable for broader medical text synthesis tasks.
Integration into hospital workflows could be explored by measuring end-to-end time savings when radiologists review and edit ECHO drafts instead of writing from scratch.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit.

Desk Editor's Note

ECHO shows a practical fix for one-step block diffusion in CXR reports that delivers real speedups while keeping the claims internally consistent.

read the letter

Hi colleague, The punchline on this one is that ECHO shows how to compress diffusion-based report generation down to one step per block for chest X-rays, using a distillation approach that keeps the token relationships intact. What they do well is lay out the Direct Conditional Distillation framework clearly. It builds supervision from actual diffusion trajectories instead of factorized approximations, which directly tackles the coherence problem in one-step sampling. The Response-Asymmetric Diffusion part also looks like a practical tweak to balance training. The abstract and methods give concrete numbers on the 8x speedup and the big lifts in RaTE and SemScore, and the stress-test confirms the construction holds together without internal issues. The soft spots are mostly around the scale of the claims. Those 64% and 60% improvements are eye-catching, so the paper needs to demonstrate that the autoregressive baselines are the current best and that the evaluation avoids common pitfalls like data leakage or cherry-picked metrics. A bit more on error modes or human evaluation would strengthen the clinical accuracy claim. Readers working on medical vision-language models or anyone trying to make diffusion models faster for text output will find this useful. It is not a theoretical breakthrough but a solid applied advance. This deserves a serious referee because the technical details are there to check and the results are falsifiable. I would recommend putting it through peer review.

Referee Report

0 major / 3 minor

Summary. The paper proposes ECHO, a diffusion-based vision-language model for chest X-ray report generation. It introduces a Direct Conditional Distillation (DCD) framework that constructs unfactorized supervision from on-policy diffusion trajectories to enable stable one-step-per-block inference, along with a Response-Asymmetric Diffusion (RAD) training strategy. Experiments claim that ECHO outperforms state-of-the-art autoregressive methods by 64.33% on RaTE and 60.58% on SemScore while delivering up to 8× inference speedup with negligible degradation in clinical accuracy.

Significance. If the reported gains and speedup are reproducible, the work could meaningfully advance efficient medical report generation by addressing the sequential decoding bottleneck of autoregressive VLMs through a one-step block diffusion approach. The on-policy trajectory supervision for mitigating mean-field bias offers a concrete technical contribution that may generalize to other parallel text generation settings.

minor comments (3)

[§3.3] §3.3 and §4.1: The implementation details for the block-wise inference schedule and the exact form of the RAD loss should be expanded with pseudocode or a small worked example to facilitate exact reproduction.
[Table 1] Table 1: All compared baselines must be accompanied by their original citations and the precise dataset splits (e.g., MIMIC-CXR train/val/test ratios) used for each metric.
[§5.2] §5.2: Include statistical significance tests (e.g., paired t-tests or bootstrap confidence intervals) for the reported RaTE and SemScore improvements to strengthen the empirical claims.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive summary, recognition of the technical contributions in Direct Conditional Distillation and Response-Asymmetric Diffusion, and recommendation for minor revision. We appreciate the assessment that the work could advance efficient medical report generation if the gains are reproducible. No specific major comments were raised in the report, so we have no individual points requiring direct rebuttal or revision at this stage. We remain available to incorporate any additional suggestions from the editor.

Circularity Check

0 steps flagged

No significant circularity; empirical claims rest on independent validation

full rationale

The paper introduces architectural innovations (Direct Conditional Distillation and Response-Asymmetric Diffusion) whose performance is validated through direct empirical comparison against autoregressive baselines on standard metrics (RaTE, SemScore) and inference latency. No load-bearing derivation reduces by construction to fitted inputs, self-citations, or renamed known results; the central claims are falsifiable via the reported experiments and do not rely on internal redefinitions or uniqueness theorems imported from prior author work.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 2 invented entities

Review is limited to the abstract; no explicit free parameters, background axioms, or invented physical entities are described. The two named techniques (DCD and RAD) function as methodological contributions rather than new entities with independent evidence.

invented entities (2)

Direct Conditional Distillation (DCD) framework no independent evidence
purpose: Mitigate mean-field bias in one-step diffusion by using unfactorized supervision from on-policy trajectories
Presented as the core novel component enabling stable one-step inference.
Response-Asymmetric Diffusion (RAD) training strategy no independent evidence
purpose: Improve training efficiency while maintaining effectiveness
Introduced as an additional training technique.

pith-pipeline@v0.9.0 · 5797 in / 1166 out tokens · 50263 ms · 2026-05-21T08:47:18.317553+00:00 · methodology

0 comments

read the original abstract

Chest X-ray report generation (CXR-RG) has the potential to substantially alleviate radiologists' workload. However, conventional autoregressive vision--language models (VLMs) suffer from high inference latency due to sequential token decoding. Diffusion-based models offer a promising alternative through parallel generation, but they still require multiple denoising iterations. Compressing multi-step denoising to a single step could further reduce latency, but often degrades textual coherence due to the mean-field bias introduced by token-factorized denoisers. To address this challenge, we propose \textbf{ECHO}, an efficient diffusion-based VLM (dVLM) for chest X-ray report generation. ECHO enables stable one-step-per-block inference via a novel Direct Conditional Distillation (DCD) framework, which mitigates the mean-field limitation by constructing unfactorized supervision from on-policy diffusion trajectories to encode joint token dependencies. In addition, we introduce a Response-Asymmetric Diffusion (RAD) training strategy that further improves training efficiency while maintaining model effectiveness. Extensive experiments demonstrate that ECHO surpasses state-of-the-art autoregressive methods, improving RaTE and SemScore by \textbf{64.33\%} and \textbf{60.58\%} respectively, while achieving up to \textbf{$8\times$} inference speedup with negligible degradation in clinical accuracy.

Review history (2 revisions) →

discussion (0)

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

AnchorDiff: Topology-Aware Masked Diffusion with Confidence-based Rewriting for Radiology Report Generation
cs.AI 2026-05 unverdicted novelty 7.0

AnchorDiff is a topology-aware masked diffusion framework with RadGraph anchors and confidence-based rewriting that claims state-of-the-art results on MIMIC-CXR and MIMIC-RG4 for radiology report generation.
Discrete Diffusion Language Models for Interactive Radiology Report Drafting
cs.AI 2026-07 unverdicted novelty 6.0

Diffusion LM matches AR performance on medical VQA, runs 3.5-4.4x faster, and enables bidirectional infilling for interactive radiology report drafting.
Temporal and Cross-Modal Alignment for Enhanced Audiovisual Video Captioning
cs.CV 2026-07 unverdicted novelty 4.0

TCA-Captioner introduces an Observer-Checker-Corrector refinement loop and TCA-Bench to address modality detachment and temporal incoherence in audiovisual video captioning.

Reference graph

Works this paper leans on

75 extracted references · 75 canonical work pages · cited by 3 Pith papers · 14 internal anchors

[1]

Block diffusion: Interpolating between autoregressive and diffusion language models

Marianne Arriola, Aaron Gokaslan, Justin T Chiu, Zhihan Yang, Zhixuan Qi, Jiaqi Han, Subham Sekhar Sahoo, and Volodymyr Kuleshov. Block diffusion: Interpolating between autoregressive and diffusion language models. InThe Thirteenth International Conference on Learning Representations, 2025

work page 2025
[2]

Structured denoising diffusion models in discrete state-spaces.Advances in neural information processing systems, 2021

Jacob Austin, Daniel D Johnson, Jonathan Ho, Daniel Tarlow, and Rianne Van Den Berg. Structured denoising diffusion models in discrete state-spaces.Advances in neural information processing systems, 2021

work page 2021
[3]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

Tiwei Bie, Maosong Cao, Kun Chen, Lun Du, Mingliang Gong, Zhuochen Gong, Yanmei Gu, Jiaqi Hu, Zenan Huang, Zhenzhong Lan, et al. Llada2. 0: Scaling up diffusion language models to 100b.arXiv preprint arXiv:2512.15745, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

Thomas, Jerome Ku, Federico Berto, Jae Myung Kim, Garyk Brixi, Eric Nguyen, Stefano Massaroli, and Michael Poli

Keshigeyan Chandrasegaran, Armin W. Thomas, Jerome Ku, Federico Berto, Jae Myung Kim, Garyk Brixi, Eric Nguyen, Stefano Massaroli, and Michael Poli. Rnd1: Simple, scalable ar-to-diffusion conversion. 2025

work page 2025
[6]

Towards injecting medical visual knowledge into multimodal llms at scale

Junying Chen, Chi Gui, Ruyi Ouyang, Anningzhe Gao, Shunian Chen, Guiming Hardy Chen, Xidong Wang, Zhenyang Cai, Ke Ji, Xiang Wan, et al. Towards injecting medical visual knowledge into multimodal llms at scale. InProceedings of the 2024 conference on empirical methods in natural language processing, 2024

work page 2024
[7]

A Vision-Language Foundation Model to Enhance Efficiency of Chest X-ray Interpretation

Zhihong Chen, Maya Varma, Justin Xu, Magdalini Paschali, Dave Van Veen, Andrew Johnston, Alaa Youssef, Louis Blankemeier, Christian Bluethgen, Stephan Altmayer, et al. A vision-language foundation model to enhance efficiency of chest x-ray interpretation.arXiv preprint arXiv:2401.12208, 2024

work page Pith review arXiv 2024
[8]

arXiv preprint arXiv:2509.26488 , year=

Zigeng Chen, Gongfan Fang, Xinyin Ma, Ruonan Yu, and Xinchao Wang. dparallel: Learnable parallel decoding for dllms.arXiv preprint arXiv:2509.26488, 2025

work page arXiv 2025
[9]

Sdar-vl: Stable and efficient block-wise diffusion for vision-language understanding.arXiv preprint arXiv:2512.14068, 2025

Shuang Cheng, Yuhua Jiang, Zineng Zhou, Dawei Liu, Wang Tao, Linfeng Zhang, Biqing Qi, and Bowen Zhou. Sdar-vl: Stable and efficient block-wise diffusion for vision-language understanding.arXiv preprint arXiv:2512.14068, 2025

work page arXiv 2025
[10]

Speculative diffusion decoding: Accelerating language generation through diffusion

Jacob K Christopher, Brian R Bartoldson, Tal Ben-Nun, Michael Cardei, Bhavya Kailkhura, and Ferdinando Fioretto. Speculative diffusion decoding: Accelerating language generation through diffusion. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volum...

work page 2025
[11]

Preparing a collection of radiology examinations for distribution and retrieval.Journal of the American Medical Informatics Association, 2016

Dina Demner-Fushman, Marc D Kohli, Marc B Rosenman, Sonya E Shooshan, Laritza Rodriguez, Sameer Antani, George R Thoma, and Clement J McDonald. Preparing a collection of radiology examinations for distribution and retrieval.Journal of the American Medical Informatics Association, 2016

work page 2016
[12]

Beyond autoregression: Fast llms via self-distillation through time, 2025

Justin Deschenaux and Caglar Gulcehre. Beyond autoregression: Fast llms via self-distillation through time, 2025

work page 2025
[13]

arXiv preprint arXiv:2508.01617

Xuanzhao Dong, Wenhui Zhu, Xiwen Chen, Zhipeng Wang, Peijie Qiu, Shao Tang, Xin Li, and Yalin Wang. Llada-medv: Exploring large language diffusion models for biomedical image understanding.arXiv preprint arXiv:2508.01617, 2025

work page arXiv 2025
[14]

Unifying autoregressive and diffusion-based sequence generation

Nima Fathi, Torsten Scholak, and Pierre-Andre Noel. Unifying autoregressive and diffusion-based sequence generation. InICLR 2025 Workshop on Deep Generative Model in Machine Learning: Theory, Principle and Efficacy, 2025

work page 2025
[15]

Scaling diffusion language models via adaptation from autoregressive models

Shansan Gong, Shivam Agarwal, Yizhe Zhang, Jiacheng Ye, Lin Zheng, Mukai Li, Chenxin An, Peilin Zhao, Wei Bi, Jiawei Han, et al. Scaling diffusion language models via adaptation from autoregressive models. InThe Thirteenth International Conference on Learning Representations, 2025

work page 2025
[16]

Gemini 3 pro model card, 2025

Google DeepMind. Gemini 3 pro model card, 2025

work page 2025
[17]

Ssd-lm: Semi-autoregressive simplex-based diffusion lan- guagemodelfortextgenerationandmodularcontrol

Xiaochuang Han, Sachin Kumar, and Yulia Tsvetkov. Ssd-lm: Semi-autoregressive simplex-based diffusion lan- guagemodelfortextgenerationandmodularcontrol. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics, 2023

work page 2023
[18]

Accelerating diffusion language model inference via efficient kv caching and guided diffusion.arXiv e-prints, 2025

Zhanqiu Hu, Jian Meng, Yash Akhauri, Mohamed S Abdelfattah, Jae-sun Seo, Zhiru Zhang, and Udit Gupta. Accelerating diffusion language model inference via efficient kv caching and guided diffusion.arXiv e-prints, 2025. 13

work page 2025
[19]

GPT-4o System Card

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[20]

Chexpert: A large chest radiograph dataset with uncertainty labels and expert comparison

Jeremy Irvin, Pranav Rajpurkar, Michael Ko, Yifan Yu, Silviana Ciurea-Ilcus, Chris Chute, Henrik Marklund, Be- hzad Haghgoo, Robyn Ball, Katie Shpanskaya, et al. Chexpert: A large chest radiograph dataset with uncertainty labels and expert comparison. InProceedings of the AAAI conference on artificial intelligence, 2019

work page 2019
[21]

arXiv preprint arXiv:2510.08668 (2025)

Songtao Jiang, Yuan Wang, Sibo Song, Tianxiang Hu, Chenyi Zhou, Bin Pu, Yan Zhang, Zhibo Yang, Yang Feng, Joey Tianyi Zhou, et al. Hulu-med: A transparent generalist model towards holistic medical vision-language understanding.arXiv preprint arXiv:2510.08668, 2025

work page arXiv 2025
[22]

Mimic-cxr, a de-identified publicly available database of chest radiographs with free-text reports.Scientific data, 2019

Alistair EW Johnson, Tom J Pollard, Seth J Berkowitz, Nathaniel R Greenbaum, Matthew P Lungren, Chih- ying Deng, Roger G Mark, and Steven Horng. Mimic-cxr, a de-identified publicly available database of chest radiographs with free-text reports.Scientific data, 2019

work page 2019
[23]

Cxr-llava: a multimodal large language model for interpreting chest x-ray images.European Radiology, 2025

Seowoo Lee, Jiwon Youn, Hyungjin Kim, Mansu Kim, and Soon Ho Yoon. Cxr-llava: a multimodal large language model for interpreting chest x-ray images.European Radiology, 2025

work page 2025
[24]

Llava-onevision: Easy visual task transfer.Transactions on Machine Learning Research, 2024

Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, et al. Llava-onevision: Easy visual task transfer.Transactions on Machine Learning Research, 2024

work page 2024
[25]

Llava-med: Training a large language-and-vision assistant for biomedicine in one day

Chunyuan Li, Cliff Wong, Sheng Zhang, Naoto Usuyama, Haotian Liu, Jianwei Yang, Tristan Naumann, Hoifung Poon, and Jianfeng Gao. Llava-med: Training a large language-and-vision assistant for biomedicine in one day. Advances in Neural Information Processing Systems, 2023

work page 2023
[26]

LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models

Feng Li, Renrui Zhang, Hao Zhang, Yuanhan Zhang, Bo Li, Wei Li, Zejun Ma, and Chunyuan Li. Llava-next- interleave: Tackling multi-image, video, and 3d in large multimodal models.arXiv preprint arXiv:2407.07895, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[27]

DiffuSpec : Unlocking diffusion language models for speculative decoding

Guanghao Li, Zhihui Fu, Min Fang, Qibin Zhao, Ming Tang, Chun Yuan, and Jun Wang. Diffuspec: Unlocking diffusion language models for speculative decoding.arXiv preprint arXiv:2510.02358, 2025

work page arXiv 2025
[28]

Lavida: A large diffusion language model for multimodal under- standing

Shufan Li, Konstantinos Kallidromitis, Hritik Bansal, Akash Gokul, Yusuke Kato, Kazuki Kozuka, Jason Kuen, Zhe Lin, Kai-Wei Chang, and Aditya Grover. Lavida: A large diffusion language model for multimodal under- standing. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

work page 2025
[29]

Cd4lm: Consistency distillation and adaptive decoding for diffusion language models.arXiv preprint arXiv:2601.02236,

Yihao Liang, Ze Wang, Hao Chen, Ximeng Sun, Jialian Wu, Xiaodong Yu, Jiang Liu, Emad Barsoum, Zicheng Liu, and Niraj K Jha. Cd4lm: Consistency distillation and adaptive decoding for diffusion language models. arXiv preprint arXiv:2601.02236, 2026

work page arXiv 2026
[30]

Rouge: A package for automatic evaluation of summaries

Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. InText summarization branches out, 2004

work page 2004
[31]

Visual instruction tuning.Advances in neural information processing systems, 2023

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 2023

work page 2023
[32]

dLLM-Cache: Accelerating Diffusion Large Language Models with Adaptive Caching

Zhiyuan Liu, Yicun Yang, Yaojie Zhang, Junjie Chen, Chang Zou, Qingyuan Wei, Shaobo Wang, and Lin- feng Zhang. dllm-cache: Accelerating diffusion large language models with adaptive caching.arXiv preprint arXiv:2506.06295, 2025

work page internal anchor Pith review arXiv 2025
[33]

Large language diffusion models

Shen Nie, Fengqi Zhu, Zebin You, Xiaolu Zhang, Jingyang Ou, Jun Hu, JUN ZHOU, Yankai Lin, Ji-Rong Wen, and Chongxuan Li. Large language diffusion models. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

work page 2025
[34]

Medvlm-r1: Incentivizing medical reasoning capability of vision-language models (vlms) via reinforcement learning

Jiazhen Pan, Che Liu, Junde Wu, Fenglin Liu, Jiayuan Zhu, Hongwei Bran Li, Chen Chen, Cheng Ouyang, and Daniel Rueckert. Medvlm-r1: Incentivizing medical reasoning capability of vision-language models (vlms) via reinforcement learning. InInternational Conference on Medical Image Computing and Computer-Assisted Intervention, 2025

work page 2025
[35]

RaDialog: A Large Vision-Language Model for Radiology Report Generation and Conversational Assistance

Chantal Pellegrini, Ege Özsoy, Benjamin Busam, Nassir Navab, and Matthias Keicher. Radialog: A large vision- language model for radiology report generation and conversational assistance.arXiv preprint arXiv:2311.18681, 2023

work page Pith review arXiv 2023
[36]

d3llm: Ultra-fast diffusion llm using pseudo-trajectory distillation.arXiv preprint arXiv:2601.07568, 2026

Yu-Yang Qian, Junda Su, Lanxiang Hu, Peiyuan Zhang, Zhijie Deng, Peng Zhao, and Hao Zhang. d3llm: Ultra- fast diffusion llm using pseudo-trajectory distillation.arXiv preprint arXiv:2601.07568, 2026. 14

work page arXiv 2026
[37]

Capabilities of Gemini Models in Medicine

Khaled Saab, Tao Tu, Wei-Hung Weng, Ryutaro Tanno, David Stutz, Ellery Wulczyn, Fan Zhang, Tim Strother, Chunjong Park, Elahe Vedadi, et al. Capabilities of gemini models in medicine.arXiv preprint arXiv:2404.18416, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[38]

Simpleandeffectivemaskeddiffusionlanguagemodels.Advances in Neural Information Processing Systems, 2024

Subham Sahoo, Marianne Arriola, Yair Schiff, Aaron Gokaslan, Edgar Marroquin, Justin Chiu, Alexander Rush, andVolodymyrKuleshov. Simpleandeffectivemaskeddiffusionlanguagemodels.Advances in Neural Information Processing Systems, 2024

work page 2024
[39]

MedGemma Technical Report

Andrew Sellergren, Sahar Kazemzadeh, Tiam Jaroensri, Atilla Kiraly, Madeleine Traverse, Timo Kohlberger, Shawn Xu, Fayaz Jamil, Cían Hughes, Charles Lau, et al. Medgemma technical report.arXiv preprint arXiv:2507.05201, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[40]

Simplified and generalized masked diffusion for discrete data.Advances in neural information processing systems, 2024

Jiaxin Shi, Kehang Han, Zhe Wang, Arnaud Doucet, and Michalis Titsias. Simplified and generalized masked diffusion for discrete data.Advances in neural information processing systems, 2024

work page 2024
[41]

Combining automatic labelers and expert annotations for accurate radiology report labeling using bert

Akshay Smit, Saahil Jain, Pranav Rajpurkar, Anuj Pareek, Andrew Y Ng, and Matthew Lungren. Combining automatic labelers and expert annotations for accurate radiology report labeling using bert. InProceedings of the 2020 conference on empirical methods in natural language processing (EMNLP), pages 1500–1519, 2020

work page 2020
[42]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalk- wyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[43]

Xraygpt: Chest radiographs summarization using large medical vision-language models

Omkar Chakradhar Thawakar, Abdelrahman M Shaker, Sahal Shaji Mullappilly, Hisham Cholakkal, Rao Muham- mad Anwer, Salman Khan, Jorma Laaksonen, and Fahad Khan. Xraygpt: Chest radiographs summarization using large medical vision-language models. InProceedings of the 23rd workshop on biomedical natural language processing, 2024

work page 2024
[44]

Towards generalist biomedical ai.Nejm Ai, 2024

Tao Tu, Shekoofeh Azizi, Danny Driess, Mike Schaekermann, Mohamed Amin, Pi-Chuan Chang, Andrew Carroll, Charles Lau, Ryutaro Tanno, Ira Ktena, et al. Towards generalist biomedical ai.Nejm Ai, 2024

work page 2024
[45]

Cider: Consensus-based image description evaluation

Ramakrishna Vedantam, C Lawrence Zitnick, and Devi Parikh. Cider: Consensus-based image description evaluation. InProceedings of the IEEE conference on computer vision and pattern recognition, 2015

work page 2015
[46]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[47]

arXiv preprint arXiv:2508.09192 , year=

Xu Wang, Chenkai Xu, Yijie Jin, Jiachun Jin, Hao Zhang, and Zhijie Deng. Diffusion llms can do faster-than-ar inference via discrete diffusion forcing.arXiv preprint arXiv:2508.09192, 2025

work page arXiv 2025
[48]

Towards generalist foundation model for radiology by leveraging web-scale 2d&3d medical data.Nature Communications, 2025

Chaoyi Wu, Xiaoman Zhang, Ya Zhang, Hui Hui, Yanfeng Wang, and Weidi Xie. Towards generalist foundation model for radiology by leveraging web-scale 2d&3d medical data.Nature Communications, 2025

work page 2025
[49]

FastdLLM v2: Efficient Block Diffusion LLM,

Chengyue Wu, Hao Zhang, Shuchen Xue, Shizhe Diao, Yonggan Fu, Zhijian Liu, Pavlo Molchanov, Ping Luo, Song Han, and Enze Xie. Fast-dllm v2: Efficient block-diffusion llm.arXiv preprint arXiv:2509.26328, 2025

work page arXiv 2025
[50]

Fast-dLLM: Training-free Acceleration of Diffusion LLM by Enabling KV Cache and Parallel Decoding

Chengyue Wu, Hao Zhang, Shuchen Xue, Zhijian Liu, Shizhe Diao, Ligeng Zhu, Ping Luo, Song Han, and Enze Xie. Fast-dllm: Training-free acceleration of diffusion llm by enabling kv cache and parallel decoding.arXiv preprint arXiv:2505.22618, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[51]

Energy-based diffusion language models for text generation

Minkai Xu, Tomas Geffner, Karsten Kreis, Weili Nie, Yilun Xu, Jure Leskovec, Stefano Ermon, and Arash Vahdat. Energy-based diffusion language models for text generation. InThe Thirteenth International Conference on Learning Representations, 2025

work page 2025
[52]

Lingshu: A Generalist Foundation Model for Unified Multimodal Medical Understanding and Reasoning

Weiwen Xu, Hou Pong Chan, Long Li, Mahani Aljunied, Ruifeng Yuan, Jianyu Wang, Chenghao Xiao, Guizhen Chen, Chaoqun Liu, Zhaodonghui Li, et al. Lingshu: A generalist foundation model for unified multimodal medical understanding and reasoning.arXiv preprint arXiv:2506.07044, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[53]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[54]

Medxchat: A unified multimodal large language model framework towards cxrs understanding and generation

Ling Yang, Zhanyu Wang, Zhenghao Chen, Xinyu Liang, and Luping Zhou. Medxchat: A unified multimodal large language model framework towards cxrs understanding and generation. In2025 IEEE 22nd International Symposium on Biomedical Imaging (ISBI), 2025. 15

work page 2025
[55]

Dream 7B: Diffusion Large Language Models

Jiacheng Ye, Zhihui Xie, Lin Zheng, Jiahui Gao, Zirui Wu, Xin Jiang, Zhenguo Li, and Lingpeng Kong. Dream 7b: Diffusion large language models.arXiv preprint arXiv:2508.15487, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[56]

Redi: Rectified discrete flow

Jaehoon Yoo, Wonjung Kim, and Seunghoon Hong. Redi: Rectified discrete flow. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

work page 2025
[57]

LLaDA-V: Large Language Diffusion Models with Visual Instruction Tuning

Zebin You, Shen Nie, Xiaolu Zhang, Jun Hu, Jun Zhou, Zhiwu Lu, Ji-Rong Wen, and Chongxuan Li. Llada-v: Large language diffusion models with visual instruction tuning.arXiv preprint arXiv:2505.16933, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[58]

Dimple: Discrete diffusion multimodal large language model with parallel decoding.arXiv preprint arXiv:2505.16990,

Runpeng Yu, Xinyin Ma, and Xinchao Wang. Dimple: Discrete diffusion multimodal large language model with parallel decoding.arXiv preprint arXiv:2505.16990, 2025

work page arXiv 2025
[59]

A generalist vision–language foundation model for diverse biomedical tasks.Nature medicine, 2024

Kai Zhang, Rong Zhou, Eashan Adhikarla, Zhiling Yan, Yixin Liu, Jun Yu, Zhengliang Liu, Xun Chen, Brian D Davison, Hui Ren, et al. A generalist vision–language foundation model for diverse biomedical tasks.Nature medicine, 2024

work page 2024
[60]

arXiv preprint arXiv:2602.12262 , year=

Tunyu Zhang, Xinxi Zhang, Ligong Han, Haizhou Shi, Xiaoxiao He, Zhuowei Li, Hao Wang, Kai Xu, Akash Srivastava, Vladimir Pavlovic, et al. T3d: Few-step diffusion language models via trajectory self-distillation with direct discriminative optimization.arXiv preprint arXiv:2602.12262, 2026

work page arXiv 2026
[61]

ReXGradient-160K: A Large-Scale Publicly Available Dataset of Chest Radiographs with Free-text Reports

Xiaoman Zhang, Julián N Acosta, Josh Miller, Ouwen Huang, and Pranav Rajpurkar. Rexgradient-160k: A large-scale publicly available dataset of chest radiographs with free-text reports.arXiv preprint arXiv:2505.00228, 2025

work page Pith review arXiv 2025
[62]

Variational masked diffusion models.arXiv preprint arXiv:2510.23606, 2025

Yichi Zhang, Alex Schwing, and Zhizhen Zhao. Variational masked diffusion models.arXiv preprint arXiv:2510.23606, 2025

work page arXiv 2025
[63]

Review this chest X-ray and write a report. Use this format: Findings: {}, Impression: {}

Weike Zhao, Chaoyi Wu, Xiaoman Zhang, Ya Zhang, Yanfeng Wang, and Weidi Xie. Ratescore: A metric for radiology report generation. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 2024. 16 Appendix This supplementary material provides additional details and results to complement the main paper, organized as follows....

work page 2024
[64]

Findings

Incomplete "Findings" sections—often omitting descriptions of normal (negative) findings

work page
[65]

Findings

Inclusion of interpretive or inferential statements within the "Findings" section that should belong in the "Impression" section. Your input will be an original report containing both "Findings" and "Impression" sections. Your output must be a standardized report in JSON format, without any additional explanations or comments. Standardization requirements:

work page
[66]

- Retain all organ/structure descriptions present in the original report

Findings: - Structure the findings in the following anatomical order: thorax, mediastinum and trachea, lung fields, cardiac silhouette, hila, diaphragm and costophrenic angles, and bony structures. - Retain all organ/structure descriptions present in the original report. - Retain any additional relevant details (e.g., presence of tubes or lines). - Retain...

work page
[67]

findings

Impression: - Retain the original impression. - You may add diagnostic conclusions or clinical recommendations based on the standardized findings and original impression. Below are examples of standardized negative reports for reference: Standardized Negative Report (1): Findings: The thorax is symmetric bilaterally. The mediastinum and trachea are midlin...

work page
[68]

- FINDINGS must contain only descriptive radiological observations

Extract and summarize the objective imaging observations into the FINDINGS section. - FINDINGS must contain only descriptive radiological observations. - Do NOT include any diagnostic conclusions in FINDINGS

work page
[69]

- IMPRESSION should reflect the clinician’s overall diagnostic assessment

Extract and summarize the diagnostic interpretation into the IMPRESSION section. - IMPRESSION should reflect the clinician’s overall diagnostic assessment. - Do NOT introduce any new diagnoses that are not explicitly stated in the original report

work page
[70]

increased bronchovascular markings

Translate the content accurately into English using standard radiology terminology. - Avoid literal word-by-word translation. - Use clinically accepted expressions (e.g., “increased bronchovascular markings” instead of “lung texture thickened”)

work page
[71]

suggestive of

Preserve all expressions of uncertainty (e.g., “suggestive of”, “cannot exclude”, “likely”, “consider”). - Do NOT convert uncertain statements into definitive conclusions

work page
[72]

- Leave the missing section empty if necessary

If the original Chinese report contains only FINDINGS or only IMPRESSION, do NOT fabricate the missing section. - Leave the missing section empty if necessary

work page
[73]

Standardized Output Format (strict): FINDINGS: <content> IMPRESSION: <content>

work page
[74]

Wrap your final output strictly within: ```output <your standardized report> ```

work page
[75]

- Do NOT include any explanation, notes, or additional commentary

Output ONLY the standardized English report. - Do NOT include any explanation, notes, or additional commentary. Here is the Chinese medical report to be processed: ```input {content} ``` - If the original content is ambiguous, incomplete, or poorly structured, you must translate it faithfully without attempting to correct or improve it. here is the output...

work page

[1] [1]

Block diffusion: Interpolating between autoregressive and diffusion language models

Marianne Arriola, Aaron Gokaslan, Justin T Chiu, Zhihan Yang, Zhixuan Qi, Jiaqi Han, Subham Sekhar Sahoo, and Volodymyr Kuleshov. Block diffusion: Interpolating between autoregressive and diffusion language models. InThe Thirteenth International Conference on Learning Representations, 2025

work page 2025

[2] [2]

Structured denoising diffusion models in discrete state-spaces.Advances in neural information processing systems, 2021

Jacob Austin, Daniel D Johnson, Jonathan Ho, Daniel Tarlow, and Rianne Van Den Berg. Structured denoising diffusion models in discrete state-spaces.Advances in neural information processing systems, 2021

work page 2021

[3] [3]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[4] [4]

Tiwei Bie, Maosong Cao, Kun Chen, Lun Du, Mingliang Gong, Zhuochen Gong, Yanmei Gu, Jiaqi Hu, Zenan Huang, Zhenzhong Lan, et al. Llada2. 0: Scaling up diffusion language models to 100b.arXiv preprint arXiv:2512.15745, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[5] [5]

Thomas, Jerome Ku, Federico Berto, Jae Myung Kim, Garyk Brixi, Eric Nguyen, Stefano Massaroli, and Michael Poli

Keshigeyan Chandrasegaran, Armin W. Thomas, Jerome Ku, Federico Berto, Jae Myung Kim, Garyk Brixi, Eric Nguyen, Stefano Massaroli, and Michael Poli. Rnd1: Simple, scalable ar-to-diffusion conversion. 2025

work page 2025

[6] [6]

Towards injecting medical visual knowledge into multimodal llms at scale

Junying Chen, Chi Gui, Ruyi Ouyang, Anningzhe Gao, Shunian Chen, Guiming Hardy Chen, Xidong Wang, Zhenyang Cai, Ke Ji, Xiang Wan, et al. Towards injecting medical visual knowledge into multimodal llms at scale. InProceedings of the 2024 conference on empirical methods in natural language processing, 2024

work page 2024

[7] [7]

A Vision-Language Foundation Model to Enhance Efficiency of Chest X-ray Interpretation

Zhihong Chen, Maya Varma, Justin Xu, Magdalini Paschali, Dave Van Veen, Andrew Johnston, Alaa Youssef, Louis Blankemeier, Christian Bluethgen, Stephan Altmayer, et al. A vision-language foundation model to enhance efficiency of chest x-ray interpretation.arXiv preprint arXiv:2401.12208, 2024

work page Pith review arXiv 2024

[8] [8]

arXiv preprint arXiv:2509.26488 , year=

Zigeng Chen, Gongfan Fang, Xinyin Ma, Ruonan Yu, and Xinchao Wang. dparallel: Learnable parallel decoding for dllms.arXiv preprint arXiv:2509.26488, 2025

work page arXiv 2025

[9] [9]

Sdar-vl: Stable and efficient block-wise diffusion for vision-language understanding.arXiv preprint arXiv:2512.14068, 2025

Shuang Cheng, Yuhua Jiang, Zineng Zhou, Dawei Liu, Wang Tao, Linfeng Zhang, Biqing Qi, and Bowen Zhou. Sdar-vl: Stable and efficient block-wise diffusion for vision-language understanding.arXiv preprint arXiv:2512.14068, 2025

work page arXiv 2025

[10] [10]

Speculative diffusion decoding: Accelerating language generation through diffusion

Jacob K Christopher, Brian R Bartoldson, Tal Ben-Nun, Michael Cardei, Bhavya Kailkhura, and Ferdinando Fioretto. Speculative diffusion decoding: Accelerating language generation through diffusion. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volum...

work page 2025

[11] [11]

Preparing a collection of radiology examinations for distribution and retrieval.Journal of the American Medical Informatics Association, 2016

Dina Demner-Fushman, Marc D Kohli, Marc B Rosenman, Sonya E Shooshan, Laritza Rodriguez, Sameer Antani, George R Thoma, and Clement J McDonald. Preparing a collection of radiology examinations for distribution and retrieval.Journal of the American Medical Informatics Association, 2016

work page 2016

[12] [12]

Beyond autoregression: Fast llms via self-distillation through time, 2025

Justin Deschenaux and Caglar Gulcehre. Beyond autoregression: Fast llms via self-distillation through time, 2025

work page 2025

[13] [13]

arXiv preprint arXiv:2508.01617

Xuanzhao Dong, Wenhui Zhu, Xiwen Chen, Zhipeng Wang, Peijie Qiu, Shao Tang, Xin Li, and Yalin Wang. Llada-medv: Exploring large language diffusion models for biomedical image understanding.arXiv preprint arXiv:2508.01617, 2025

work page arXiv 2025

[14] [14]

Unifying autoregressive and diffusion-based sequence generation

Nima Fathi, Torsten Scholak, and Pierre-Andre Noel. Unifying autoregressive and diffusion-based sequence generation. InICLR 2025 Workshop on Deep Generative Model in Machine Learning: Theory, Principle and Efficacy, 2025

work page 2025

[15] [15]

Scaling diffusion language models via adaptation from autoregressive models

Shansan Gong, Shivam Agarwal, Yizhe Zhang, Jiacheng Ye, Lin Zheng, Mukai Li, Chenxin An, Peilin Zhao, Wei Bi, Jiawei Han, et al. Scaling diffusion language models via adaptation from autoregressive models. InThe Thirteenth International Conference on Learning Representations, 2025

work page 2025

[16] [16]

Gemini 3 pro model card, 2025

Google DeepMind. Gemini 3 pro model card, 2025

work page 2025

[17] [17]

Ssd-lm: Semi-autoregressive simplex-based diffusion lan- guagemodelfortextgenerationandmodularcontrol

Xiaochuang Han, Sachin Kumar, and Yulia Tsvetkov. Ssd-lm: Semi-autoregressive simplex-based diffusion lan- guagemodelfortextgenerationandmodularcontrol. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics, 2023

work page 2023

[18] [18]

Accelerating diffusion language model inference via efficient kv caching and guided diffusion.arXiv e-prints, 2025

Zhanqiu Hu, Jian Meng, Yash Akhauri, Mohamed S Abdelfattah, Jae-sun Seo, Zhiru Zhang, and Udit Gupta. Accelerating diffusion language model inference via efficient kv caching and guided diffusion.arXiv e-prints, 2025. 13

work page 2025

[19] [19]

GPT-4o System Card

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[20] [20]

Chexpert: A large chest radiograph dataset with uncertainty labels and expert comparison

Jeremy Irvin, Pranav Rajpurkar, Michael Ko, Yifan Yu, Silviana Ciurea-Ilcus, Chris Chute, Henrik Marklund, Be- hzad Haghgoo, Robyn Ball, Katie Shpanskaya, et al. Chexpert: A large chest radiograph dataset with uncertainty labels and expert comparison. InProceedings of the AAAI conference on artificial intelligence, 2019

work page 2019

[21] [21]

arXiv preprint arXiv:2510.08668 (2025)

Songtao Jiang, Yuan Wang, Sibo Song, Tianxiang Hu, Chenyi Zhou, Bin Pu, Yan Zhang, Zhibo Yang, Yang Feng, Joey Tianyi Zhou, et al. Hulu-med: A transparent generalist model towards holistic medical vision-language understanding.arXiv preprint arXiv:2510.08668, 2025

work page arXiv 2025

[22] [22]

Mimic-cxr, a de-identified publicly available database of chest radiographs with free-text reports.Scientific data, 2019

Alistair EW Johnson, Tom J Pollard, Seth J Berkowitz, Nathaniel R Greenbaum, Matthew P Lungren, Chih- ying Deng, Roger G Mark, and Steven Horng. Mimic-cxr, a de-identified publicly available database of chest radiographs with free-text reports.Scientific data, 2019

work page 2019

[23] [23]

Cxr-llava: a multimodal large language model for interpreting chest x-ray images.European Radiology, 2025

Seowoo Lee, Jiwon Youn, Hyungjin Kim, Mansu Kim, and Soon Ho Yoon. Cxr-llava: a multimodal large language model for interpreting chest x-ray images.European Radiology, 2025

work page 2025

[24] [24]

Llava-onevision: Easy visual task transfer.Transactions on Machine Learning Research, 2024

Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, et al. Llava-onevision: Easy visual task transfer.Transactions on Machine Learning Research, 2024

work page 2024

[25] [25]

Llava-med: Training a large language-and-vision assistant for biomedicine in one day

Chunyuan Li, Cliff Wong, Sheng Zhang, Naoto Usuyama, Haotian Liu, Jianwei Yang, Tristan Naumann, Hoifung Poon, and Jianfeng Gao. Llava-med: Training a large language-and-vision assistant for biomedicine in one day. Advances in Neural Information Processing Systems, 2023

work page 2023

[26] [26]

LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models

Feng Li, Renrui Zhang, Hao Zhang, Yuanhan Zhang, Bo Li, Wei Li, Zejun Ma, and Chunyuan Li. Llava-next- interleave: Tackling multi-image, video, and 3d in large multimodal models.arXiv preprint arXiv:2407.07895, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[27] [27]

DiffuSpec : Unlocking diffusion language models for speculative decoding

Guanghao Li, Zhihui Fu, Min Fang, Qibin Zhao, Ming Tang, Chun Yuan, and Jun Wang. Diffuspec: Unlocking diffusion language models for speculative decoding.arXiv preprint arXiv:2510.02358, 2025

work page arXiv 2025

[28] [28]

Lavida: A large diffusion language model for multimodal under- standing

Shufan Li, Konstantinos Kallidromitis, Hritik Bansal, Akash Gokul, Yusuke Kato, Kazuki Kozuka, Jason Kuen, Zhe Lin, Kai-Wei Chang, and Aditya Grover. Lavida: A large diffusion language model for multimodal under- standing. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

work page 2025

[29] [29]

Cd4lm: Consistency distillation and adaptive decoding for diffusion language models.arXiv preprint arXiv:2601.02236,

Yihao Liang, Ze Wang, Hao Chen, Ximeng Sun, Jialian Wu, Xiaodong Yu, Jiang Liu, Emad Barsoum, Zicheng Liu, and Niraj K Jha. Cd4lm: Consistency distillation and adaptive decoding for diffusion language models. arXiv preprint arXiv:2601.02236, 2026

work page arXiv 2026

[30] [30]

Rouge: A package for automatic evaluation of summaries

Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. InText summarization branches out, 2004

work page 2004

[31] [31]

Visual instruction tuning.Advances in neural information processing systems, 2023

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 2023

work page 2023

[32] [32]

dLLM-Cache: Accelerating Diffusion Large Language Models with Adaptive Caching

Zhiyuan Liu, Yicun Yang, Yaojie Zhang, Junjie Chen, Chang Zou, Qingyuan Wei, Shaobo Wang, and Lin- feng Zhang. dllm-cache: Accelerating diffusion large language models with adaptive caching.arXiv preprint arXiv:2506.06295, 2025

work page internal anchor Pith review arXiv 2025

[33] [33]

Large language diffusion models

Shen Nie, Fengqi Zhu, Zebin You, Xiaolu Zhang, Jingyang Ou, Jun Hu, JUN ZHOU, Yankai Lin, Ji-Rong Wen, and Chongxuan Li. Large language diffusion models. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

work page 2025

[34] [34]

Medvlm-r1: Incentivizing medical reasoning capability of vision-language models (vlms) via reinforcement learning

Jiazhen Pan, Che Liu, Junde Wu, Fenglin Liu, Jiayuan Zhu, Hongwei Bran Li, Chen Chen, Cheng Ouyang, and Daniel Rueckert. Medvlm-r1: Incentivizing medical reasoning capability of vision-language models (vlms) via reinforcement learning. InInternational Conference on Medical Image Computing and Computer-Assisted Intervention, 2025

work page 2025

[35] [35]

RaDialog: A Large Vision-Language Model for Radiology Report Generation and Conversational Assistance

Chantal Pellegrini, Ege Özsoy, Benjamin Busam, Nassir Navab, and Matthias Keicher. Radialog: A large vision- language model for radiology report generation and conversational assistance.arXiv preprint arXiv:2311.18681, 2023

work page Pith review arXiv 2023

[36] [36]

d3llm: Ultra-fast diffusion llm using pseudo-trajectory distillation.arXiv preprint arXiv:2601.07568, 2026

Yu-Yang Qian, Junda Su, Lanxiang Hu, Peiyuan Zhang, Zhijie Deng, Peng Zhao, and Hao Zhang. d3llm: Ultra- fast diffusion llm using pseudo-trajectory distillation.arXiv preprint arXiv:2601.07568, 2026. 14

work page arXiv 2026

[37] [37]

Capabilities of Gemini Models in Medicine

Khaled Saab, Tao Tu, Wei-Hung Weng, Ryutaro Tanno, David Stutz, Ellery Wulczyn, Fan Zhang, Tim Strother, Chunjong Park, Elahe Vedadi, et al. Capabilities of gemini models in medicine.arXiv preprint arXiv:2404.18416, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[38] [38]

Simpleandeffectivemaskeddiffusionlanguagemodels.Advances in Neural Information Processing Systems, 2024

Subham Sahoo, Marianne Arriola, Yair Schiff, Aaron Gokaslan, Edgar Marroquin, Justin Chiu, Alexander Rush, andVolodymyrKuleshov. Simpleandeffectivemaskeddiffusionlanguagemodels.Advances in Neural Information Processing Systems, 2024

work page 2024

[39] [39]

MedGemma Technical Report

Andrew Sellergren, Sahar Kazemzadeh, Tiam Jaroensri, Atilla Kiraly, Madeleine Traverse, Timo Kohlberger, Shawn Xu, Fayaz Jamil, Cían Hughes, Charles Lau, et al. Medgemma technical report.arXiv preprint arXiv:2507.05201, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[40] [40]

Simplified and generalized masked diffusion for discrete data.Advances in neural information processing systems, 2024

Jiaxin Shi, Kehang Han, Zhe Wang, Arnaud Doucet, and Michalis Titsias. Simplified and generalized masked diffusion for discrete data.Advances in neural information processing systems, 2024

work page 2024

[41] [41]

Combining automatic labelers and expert annotations for accurate radiology report labeling using bert

Akshay Smit, Saahil Jain, Pranav Rajpurkar, Anuj Pareek, Andrew Y Ng, and Matthew Lungren. Combining automatic labelers and expert annotations for accurate radiology report labeling using bert. InProceedings of the 2020 conference on empirical methods in natural language processing (EMNLP), pages 1500–1519, 2020

work page 2020

[42] [42]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalk- wyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[43] [43]

Xraygpt: Chest radiographs summarization using large medical vision-language models

Omkar Chakradhar Thawakar, Abdelrahman M Shaker, Sahal Shaji Mullappilly, Hisham Cholakkal, Rao Muham- mad Anwer, Salman Khan, Jorma Laaksonen, and Fahad Khan. Xraygpt: Chest radiographs summarization using large medical vision-language models. InProceedings of the 23rd workshop on biomedical natural language processing, 2024

work page 2024

[44] [44]

Towards generalist biomedical ai.Nejm Ai, 2024

Tao Tu, Shekoofeh Azizi, Danny Driess, Mike Schaekermann, Mohamed Amin, Pi-Chuan Chang, Andrew Carroll, Charles Lau, Ryutaro Tanno, Ira Ktena, et al. Towards generalist biomedical ai.Nejm Ai, 2024

work page 2024

[45] [45]

Cider: Consensus-based image description evaluation

Ramakrishna Vedantam, C Lawrence Zitnick, and Devi Parikh. Cider: Consensus-based image description evaluation. InProceedings of the IEEE conference on computer vision and pattern recognition, 2015

work page 2015

[46] [46]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[47] [47]

arXiv preprint arXiv:2508.09192 , year=

Xu Wang, Chenkai Xu, Yijie Jin, Jiachun Jin, Hao Zhang, and Zhijie Deng. Diffusion llms can do faster-than-ar inference via discrete diffusion forcing.arXiv preprint arXiv:2508.09192, 2025

work page arXiv 2025

[48] [48]

Towards generalist foundation model for radiology by leveraging web-scale 2d&3d medical data.Nature Communications, 2025

Chaoyi Wu, Xiaoman Zhang, Ya Zhang, Hui Hui, Yanfeng Wang, and Weidi Xie. Towards generalist foundation model for radiology by leveraging web-scale 2d&3d medical data.Nature Communications, 2025

work page 2025

[49] [49]

FastdLLM v2: Efficient Block Diffusion LLM,

Chengyue Wu, Hao Zhang, Shuchen Xue, Shizhe Diao, Yonggan Fu, Zhijian Liu, Pavlo Molchanov, Ping Luo, Song Han, and Enze Xie. Fast-dllm v2: Efficient block-diffusion llm.arXiv preprint arXiv:2509.26328, 2025

work page arXiv 2025

[50] [50]

Fast-dLLM: Training-free Acceleration of Diffusion LLM by Enabling KV Cache and Parallel Decoding

Chengyue Wu, Hao Zhang, Shuchen Xue, Zhijian Liu, Shizhe Diao, Ligeng Zhu, Ping Luo, Song Han, and Enze Xie. Fast-dllm: Training-free acceleration of diffusion llm by enabling kv cache and parallel decoding.arXiv preprint arXiv:2505.22618, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[51] [51]

Energy-based diffusion language models for text generation

Minkai Xu, Tomas Geffner, Karsten Kreis, Weili Nie, Yilun Xu, Jure Leskovec, Stefano Ermon, and Arash Vahdat. Energy-based diffusion language models for text generation. InThe Thirteenth International Conference on Learning Representations, 2025

work page 2025

[52] [52]

Lingshu: A Generalist Foundation Model for Unified Multimodal Medical Understanding and Reasoning

Weiwen Xu, Hou Pong Chan, Long Li, Mahani Aljunied, Ruifeng Yuan, Jianyu Wang, Chenghao Xiao, Guizhen Chen, Chaoqun Liu, Zhaodonghui Li, et al. Lingshu: A generalist foundation model for unified multimodal medical understanding and reasoning.arXiv preprint arXiv:2506.07044, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[53] [53]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[54] [54]

Medxchat: A unified multimodal large language model framework towards cxrs understanding and generation

Ling Yang, Zhanyu Wang, Zhenghao Chen, Xinyu Liang, and Luping Zhou. Medxchat: A unified multimodal large language model framework towards cxrs understanding and generation. In2025 IEEE 22nd International Symposium on Biomedical Imaging (ISBI), 2025. 15

work page 2025

[55] [55]

Dream 7B: Diffusion Large Language Models

Jiacheng Ye, Zhihui Xie, Lin Zheng, Jiahui Gao, Zirui Wu, Xin Jiang, Zhenguo Li, and Lingpeng Kong. Dream 7b: Diffusion large language models.arXiv preprint arXiv:2508.15487, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[56] [56]

Redi: Rectified discrete flow

Jaehoon Yoo, Wonjung Kim, and Seunghoon Hong. Redi: Rectified discrete flow. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

work page 2025

[57] [57]

LLaDA-V: Large Language Diffusion Models with Visual Instruction Tuning

Zebin You, Shen Nie, Xiaolu Zhang, Jun Hu, Jun Zhou, Zhiwu Lu, Ji-Rong Wen, and Chongxuan Li. Llada-v: Large language diffusion models with visual instruction tuning.arXiv preprint arXiv:2505.16933, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[58] [58]

Dimple: Discrete diffusion multimodal large language model with parallel decoding.arXiv preprint arXiv:2505.16990,

Runpeng Yu, Xinyin Ma, and Xinchao Wang. Dimple: Discrete diffusion multimodal large language model with parallel decoding.arXiv preprint arXiv:2505.16990, 2025

work page arXiv 2025

[59] [59]

A generalist vision–language foundation model for diverse biomedical tasks.Nature medicine, 2024

Kai Zhang, Rong Zhou, Eashan Adhikarla, Zhiling Yan, Yixin Liu, Jun Yu, Zhengliang Liu, Xun Chen, Brian D Davison, Hui Ren, et al. A generalist vision–language foundation model for diverse biomedical tasks.Nature medicine, 2024

work page 2024

[60] [60]

arXiv preprint arXiv:2602.12262 , year=

Tunyu Zhang, Xinxi Zhang, Ligong Han, Haizhou Shi, Xiaoxiao He, Zhuowei Li, Hao Wang, Kai Xu, Akash Srivastava, Vladimir Pavlovic, et al. T3d: Few-step diffusion language models via trajectory self-distillation with direct discriminative optimization.arXiv preprint arXiv:2602.12262, 2026

work page arXiv 2026

[61] [61]

ReXGradient-160K: A Large-Scale Publicly Available Dataset of Chest Radiographs with Free-text Reports

Xiaoman Zhang, Julián N Acosta, Josh Miller, Ouwen Huang, and Pranav Rajpurkar. Rexgradient-160k: A large-scale publicly available dataset of chest radiographs with free-text reports.arXiv preprint arXiv:2505.00228, 2025

work page Pith review arXiv 2025

[62] [62]

Variational masked diffusion models.arXiv preprint arXiv:2510.23606, 2025

Yichi Zhang, Alex Schwing, and Zhizhen Zhao. Variational masked diffusion models.arXiv preprint arXiv:2510.23606, 2025

work page arXiv 2025

[63] [63]

Review this chest X-ray and write a report. Use this format: Findings: {}, Impression: {}

Weike Zhao, Chaoyi Wu, Xiaoman Zhang, Ya Zhang, Yanfeng Wang, and Weidi Xie. Ratescore: A metric for radiology report generation. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 2024. 16 Appendix This supplementary material provides additional details and results to complement the main paper, organized as follows....

work page 2024

[64] [64]

Findings

Incomplete "Findings" sections—often omitting descriptions of normal (negative) findings

work page

[65] [65]

Findings

Inclusion of interpretive or inferential statements within the "Findings" section that should belong in the "Impression" section. Your input will be an original report containing both "Findings" and "Impression" sections. Your output must be a standardized report in JSON format, without any additional explanations or comments. Standardization requirements:

work page

[66] [66]

- Retain all organ/structure descriptions present in the original report

Findings: - Structure the findings in the following anatomical order: thorax, mediastinum and trachea, lung fields, cardiac silhouette, hila, diaphragm and costophrenic angles, and bony structures. - Retain all organ/structure descriptions present in the original report. - Retain any additional relevant details (e.g., presence of tubes or lines). - Retain...

work page

[67] [67]

findings

Impression: - Retain the original impression. - You may add diagnostic conclusions or clinical recommendations based on the standardized findings and original impression. Below are examples of standardized negative reports for reference: Standardized Negative Report (1): Findings: The thorax is symmetric bilaterally. The mediastinum and trachea are midlin...

work page

[68] [68]

- FINDINGS must contain only descriptive radiological observations

Extract and summarize the objective imaging observations into the FINDINGS section. - FINDINGS must contain only descriptive radiological observations. - Do NOT include any diagnostic conclusions in FINDINGS

work page

[69] [69]

- IMPRESSION should reflect the clinician’s overall diagnostic assessment

Extract and summarize the diagnostic interpretation into the IMPRESSION section. - IMPRESSION should reflect the clinician’s overall diagnostic assessment. - Do NOT introduce any new diagnoses that are not explicitly stated in the original report

work page

[70] [70]

increased bronchovascular markings

Translate the content accurately into English using standard radiology terminology. - Avoid literal word-by-word translation. - Use clinically accepted expressions (e.g., “increased bronchovascular markings” instead of “lung texture thickened”)

work page

[71] [71]

suggestive of

Preserve all expressions of uncertainty (e.g., “suggestive of”, “cannot exclude”, “likely”, “consider”). - Do NOT convert uncertain statements into definitive conclusions

work page

[72] [72]

- Leave the missing section empty if necessary

If the original Chinese report contains only FINDINGS or only IMPRESSION, do NOT fabricate the missing section. - Leave the missing section empty if necessary

work page

[73] [73]

Standardized Output Format (strict): FINDINGS: <content> IMPRESSION: <content>

work page

[74] [74]

Wrap your final output strictly within: ```output <your standardized report> ```

work page

[75] [75]

- Do NOT include any explanation, notes, or additional commentary

Output ONLY the standardized English report. - Do NOT include any explanation, notes, or additional commentary. Here is the Chinese medical report to be processed: ```input {content} ``` - If the original content is ambiguous, incomplete, or poorly structured, you must translate it faithfully without attempting to correct or improve it. here is the output...

work page