Unify, Align and Refine: Multi-Level Semantic Alignment for Radiology Report Generation

Bang Yang; Hongxiang Li; Xuxin Cheng; Yaowei Li; Yuexian Zou; Zhihong Zhu

arxiv: 2303.15932 · v5 · pith:24ZUM6VJnew · submitted 2023-03-28 · 💻 cs.CV

Unify, Align and Refine: Multi-Level Semantic Alignment for Radiology Report Generation

Yaowei Li , Bang Yang , Xuxin Cheng , Zhihong Zhu , Hongxiang Li , Yuexian Zou This is my paper

classification 💻 cs.CV

keywords alignmentscross-modalreportthenalignalignmentfirstgeneration

0 comments

read the original abstract

Automatic radiology report generation has attracted enormous research interest due to its practical value in reducing the workload of radiologists. However, simultaneously establishing global correspondences between the image (e.g., Chest X-ray) and its related report and local alignments between image patches and keywords remains challenging. To this end, we propose an Unify, Align and then Refine (UAR) approach to learn multi-level cross-modal alignments and introduce three novel modules: Latent Space Unifier (LSU), Cross-modal Representation Aligner (CRA) and Text-to-Image Refiner (TIR). Specifically, LSU unifies multimodal data into discrete tokens, making it flexible to learn common knowledge among modalities with a shared network. The modality-agnostic CRA learns discriminative features via a set of orthonormal basis and a dual-gate mechanism first and then globally aligns visual and textual representations under a triplet contrastive loss. TIR boosts token-level local alignment via calibrating text-to-image attention with a learnable mask. Additionally, we design a two-stage training procedure to make UAR gradually grasp cross-modal alignments at different levels, which imitates radiologists' workflow: writing sentence by sentence first and then checking word by word. Extensive experiments and analyses on IU-Xray and MIMIC-CXR benchmark datasets demonstrate the superiority of our UAR against varied state-of-the-art methods.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

LLaMA-XR: A Novel Framework for Radiology Report Generation using LLaMA and QLoRA Fine Tuning
eess.IV 2025-05 unverdicted novelty 3.0

LLaMA-XR fine-tunes LLaMA 3.1 with QLoRA on DenseNet-121 embeddings to generate radiology reports from chest X-rays, reporting ROUGE-L of 0.433 and METEOR of 0.336 on the IU X-ray benchmark.