arxiv: 2604.27559 · v1 · submitted 2026-04-30 · 💻 cs.CV · cs.AI

Recognition: unknown

RIHA: Report-Image Hierarchical Alignment for Radiology Report Generation

Yucheng Chen , Yang Yu , Yufei Shi , Conghao Xiong , Xulei Yang , Si Yong Yeo

Authors on Pith no claims yet

Pith reviewed 2026-05-07 09:54 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords radiology report generationhierarchical alignmentoptimal transportchest X-raycross-modal alignmenttransformerclinical efficacynatural language generation

0 comments

The pith

RIHA generates more accurate radiology reports by aligning image features with report structure at paragraph, sentence, and word levels.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces RIHA, a transformer-based framework for generating radiology reports from medical images that tackles fine-grained alignment between visual features and the structured hierarchies in long clinical texts. Prior methods often flatten reports into simple sequences, which loses the paragraph-to-sentence-to-word organization essential for capturing diagnostic nuances. RIHA counters this with visual and text feature pyramids plus optimal transport to match features across those levels, plus relative positional encoding in the decoder for tighter token alignment. On the IU-Xray and MIMIC-CXR chest X-ray benchmarks it records gains over prior models in both language quality scores and clinical correctness measures. A reader would care because better automated reports could cut radiologist workload while lowering the chance of missed findings.

Core claim

RIHA is an end-to-end framework that performs multi-level alignment between radiological images and their corresponding reports across paragraph, sentence, and word levels. It introduces a Visual Feature Pyramid to extract multi-scale visual features and a Text Feature Pyramid to represent multi-granularity textual structures, integrated via a Cross-modal Hierarchical Alignment module that uses optimal transport. Relative Positional Encoding is added to the decoder to enhance token-level alignment, leading to superior performance on benchmark datasets.

What carries the argument

The Cross-modal Hierarchical Alignment module, which leverages optimal transport to align multi-scale visual features from the Visual Feature Pyramid with multi-granularity textual structures from the Text Feature Pyramid at paragraph, sentence, and word levels.

If this is right

Outperforms existing state-of-the-art models in natural language generation metrics on IU-Xray and MIMIC-CXR
Records higher clinical efficacy metric scores on the same two datasets
Captures nuanced semantics in clinical narratives through precise cross-modal mapping
Strengthens token-level alignment between visual features and generated text via relative positional encoding

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same multi-level alignment idea could be tested on report generation from CT or MRI volumes where spatial hierarchies are even more pronounced
Embedding such a model in clinical workflows might shorten report turnaround times by surfacing structured findings automatically
The emphasis on report structure implies that hierarchical alignment techniques may transfer to other image-to-long-text tasks such as pathology slide captioning

Load-bearing premise

That multi-level feature pyramids combined with optimal transport will produce clinically faithful alignments without introducing or omitting critical diagnostic details that flat-sequence models already handle adequately.

What would settle it

A side-by-side review of generated reports on test cases containing subtle or multiple findings, checking whether RIHA omits or fabricates a key abnormality that a non-hierarchical baseline correctly includes or excludes.

Figures

Figures reproduced from arXiv: 2604.27559 by Conghao Xiong, Si Yong Yeo, Xulei Yang, Yang Yu, Yucheng Chen, Yufei Shi.

**Figure 1.** Figure 1: A chest X-ray image alongside its corresponding radiology report. The boxes on the right display, respectively, the individual sentences and the extracted keywords from the report. Multi-level visual-textual alignments are indicated using matching colours to highlight their associations. coherent report generation. Existing approaches to address these alignment challenges broadly fall into two categories: … view at source ↗

**Figure 2.** Figure 2: The architecture of RIHA: An image is fed into the VFP Extractor to obtain shallow, middle, and high-level features. The multi-granularity text features of paragraph, sentence, and word-level features are extracted by the TFP extractor. Multi-granularity visual and textual features are then sent into CHA for hierarchical alignment. After that, refined visual and textual features are fed into a transformer … view at source ↗

**Figure 4.** Figure 4: An illustration comparing relative and absolute positional embeddings in transformers, where the clipped value k = 3 represents the maximum allowable relative position distance. a) Absolute position embedding weights. b) Relative position embedding weights. c) The transformer encoder structure with relative position embeddings. For further details, see [68]. A. Datasets and Evaluation Metrics IU-Xray [69] … view at source ↗

**Figure 5.** Figure 5: Examples of generated reports from the MIMIC-CXR testing subset using the baseline model and our proposed RIHA method. Identical findings in the ground truth (GT) and generated reports are highlighted with matching colors, demonstrating the superior performance of our approach view at source ↗

**Figure 6.** Figure 6: Example reports generated by incrementally incorporating the proposed modules into the baseline model are shown. Key medical terms are highlighted in different colors to clearly differentiate model performance view at source ↗

**Figure 7.** Figure 7: Attention map visualizations for various keywords from the Baseline and RIHA models reveal that RIHA assigns more precise attention regions, highlighting its improved focus for each keyword. illustrating our model’s capacity to align textual descriptions with corresponding anatomical regions. These heat maps show precise localization of pleural effusion, cardiac structures, and lung fields, validating RIHA… view at source ↗

read the original abstract

Radiology report generation (RRG) has emerged as a promising approach to alleviate radiologists' workload and reduce human errors by automatically generating diagnostic reports from medical images. A key challenge in RRG is achieving fine-grained alignment between complex visual features and the hierarchical structure of long-form radiology reports. Although recent methods have improved image-text representation learning, they often treat reports as flat sequences, overlooking their structured sections and semantic hierarchies. This simplification hinders precise cross-modal alignment and weakens RRG accuracy. To address this challenge, we propose RIHA (Report-Image Hierarchical Alignment Transformer), a novel end-to-end framework that performs multi-level alignment between radiological images and their corresponding reports across paragraph, sentence, and word levels. This hierarchical alignment enables more precise cross-modal mapping, essential for capturing the nuanced semantics embedded in clinical narratives. Specifically, RIHA introduces a Visual Feature Pyramid (VFP) to extract multi-scale visual features and a Text Feature Pyramid (TFP) to represent multi-granularity textual structures. These components are integrated through a Cross-modal Hierarchical Alignment (CHA) module, leveraging optimal transport to effectively align visual and textual features across various levels. Furthermore, we incorporate Relative Positional Encoding (RPE) into the decoder to model spatial and semantic relationships among tokens, enhancing the token-level alignment between visual features and generated text. Extensive experiments on two benchmark chest X-ray datasets, IU-Xray and MIMIC-CXR, demonstrate that RIHA outperforms existing state-of-the-art models in both natural language generation and clinical efficacy metrics.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RIHA adds multi-level pyramids and optimal transport for radiology report alignment, but the gains look incremental and the clinical detail preservation needs checking.

read the letter

RIHA's main contribution is a hierarchical setup that extracts multi-scale visual features via a pyramid, builds corresponding text pyramids at paragraph/sentence/word levels, and aligns them with optimal transport in the CHA module, plus relative positional encoding in the decoder. This directly targets the flat-sequence treatment common in prior RRG work and gives a concrete architecture for cross-modal mapping at different granularities. The description of how VFP and TFP feed into CHA is clear enough to follow, and the choice of OT for distribution matching across levels is a reasonable technical move rather than a vague attention tweak. Experiments on the two standard chest X-ray sets claim better NLG and clinical efficacy scores, which at least shows the components can be trained end-to-end without obvious collapse. That is the solid part worth noting. The soft spots are more about verification than outright breakage. The abstract gives no information on how baselines were re-run, whether error bars or significance tests were used, or checks for data leakage between train and test splits. The OT step solves a global transport problem but supplies no explicit guard for low-frequency but high-stakes items such as negations, rare lesion descriptors, or scope of findings; aggregate metrics can rise while those details still drop. The stress-test concern therefore lands: nothing in the given description rules out the possibility that improvements are mostly on common phrasing rather than faithful clinical content. No entity-level error analysis is mentioned to close that gap. This paper is for groups already working on medical report generation or cross-modal alignment in vision-language models. Someone looking for a new baseline architecture on IU-Xray or MIMIC-CXR could extract the pyramid-plus-OT pattern and test it. It is coherent on its own terms and uses reproducible datasets, so it deserves a serious referee who can ask for the missing experimental controls and per-finding breakdowns. I would send it to review rather than desk-reject.

Referee Report

2 major / 2 minor

Summary. The paper proposes RIHA, a transformer-based framework for radiology report generation that extracts multi-scale visual features via a Visual Feature Pyramid (VFP), multi-granularity textual features via a Text Feature Pyramid (TFP), and aligns them across paragraph/sentence/word levels using a Cross-modal Hierarchical Alignment (CHA) module based on optimal transport; it further adds Relative Positional Encoding in the decoder. Experiments on IU-Xray and MIMIC-CXR are reported to show gains over prior SOTA in both NLG metrics and clinical efficacy metrics.

Significance. If the hierarchical alignment demonstrably improves clinical fidelity without introducing omissions, the work would usefully extend representation learning for structured medical text generation. The integration of optimal transport across explicit pyramid levels is a concrete technical step beyond flat-sequence baselines.

major comments (2)

[CHA module and Experiments] The central claim that VFP+TFP+CHA yields clinically faithful alignments superior to flat-sequence models rests on aggregate NLG and clinical metrics; however, the manuscript provides no entity-level error analysis (e.g., per-finding omission rates for lesions, negations, or rare descriptors) to rule out the possibility that global optimal transport improves averages while still dropping sparse high-stakes details that flat models already capture.
[Experiments] The experimental section must report baseline re-implementations, statistical significance tests, error bars across runs, and explicit checks for data leakage between train/test splits on both IU-Xray and MIMIC-CXR; without these, the reported outperformance cannot be taken as load-bearing evidence for the hierarchical design.

minor comments (2)

[Method] Notation for the three pyramid levels (paragraph/sentence/word) and the precise formulation of the optimal-transport cost matrix should be introduced with a single equation block rather than scattered prose.
[Figures] Figure captions for the overall architecture and feature-pyramid diagrams should explicitly label the input/output tensors at each level to aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback, which helps clarify the need for more granular validation of our hierarchical alignment claims and stronger experimental protocols. We address each major comment below and will revise the manuscript to incorporate the requested elements where feasible.

read point-by-point responses

Referee: [CHA module and Experiments] The central claim that VFP+TFP+CHA yields clinically faithful alignments superior to flat-sequence models rests on aggregate NLG and clinical metrics; however, the manuscript provides no entity-level error analysis (e.g., per-finding omission rates for lesions, negations, or rare descriptors) to rule out the possibility that global optimal transport improves averages while still dropping sparse high-stakes details that flat models already capture.

Authors: We agree that aggregate metrics alone leave open the possibility of overlooking omissions in sparse but clinically critical elements. In the revised manuscript we will add an entity-level error analysis section that reports per-finding omission rates for lesions, negations, and rare descriptors on both IU-Xray and MIMIC-CXR, directly comparing RIHA against the strongest flat-sequence baselines to quantify whether the hierarchical optimal-transport alignment reduces such omissions. revision: yes
Referee: [Experiments] The experimental section must report baseline re-implementations, statistical significance tests, error bars across runs, and explicit checks for data leakage between train/test splits on both IU-Xray and MIMIC-CXR; without these, the reported outperformance cannot be taken as load-bearing evidence for the hierarchical design.

Authors: We acknowledge that these experimental details are necessary for the results to serve as load-bearing evidence. The revised experimental section will include: (i) re-implementations of the primary baselines using the same training protocols, (ii) statistical significance tests (paired t-tests with p-values) on the NLG and clinical metrics, (iii) error bars computed over multiple random seeds, and (iv) explicit data-leakage verification confirming that no patient or study overlap exists between train and test splits on either dataset. revision: yes

Circularity Check

0 steps flagged

No circularity: RIHA's multi-level alignment is an independent architectural proposal evaluated empirically

full rationale

The paper introduces RIHA as a novel end-to-end framework combining Visual Feature Pyramid (VFP), Text Feature Pyramid (TFP), Cross-modal Hierarchical Alignment (CHA) via optimal transport, and Relative Positional Encoding (RPE). Performance claims rest on standard benchmark evaluations (IU-Xray, MIMIC-CXR) for NLG and clinical metrics, with no equations or derivations that reduce predictions to fitted inputs, self-definitions, or self-citation chains. The hierarchical alignment is motivated externally by limitations of flat-sequence models rather than being tautological with its own components. No load-bearing step equates any result to its construction inputs.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The framework rests on standard deep-learning assumptions plus the domain-specific claim that report hierarchy can be captured by feature pyramids and aligned via optimal transport; no new physical entities are introduced.

free parameters (2)

Feature pyramid scales and dimensions
The number of levels and channel sizes in VFP and TFP are architectural choices tuned on validation data.
Optimal transport regularization parameters
Parameters controlling the transport cost and entropy regularization are learned or selected during training.

axioms (2)

domain assumption Optimal transport provides an effective mechanism for cross-modal alignment at multiple granularities
Invoked in the CHA module description without further justification in the abstract.
domain assumption Relative positional encoding improves token-level alignment between visual features and generated text
Stated as an enhancement in the decoder without derivation.

pith-pipeline@v0.9.0 · 5590 in / 1402 out tokens · 50397 ms · 2026-05-07T09:54:46.644493+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

85 extracted references · 13 canonical work pages

[1]

The shortage of radiographers: A global crisis in healthcare,

K. Konstantinidis, “The shortage of radiographers: A global crisis in healthcare,”Journal of medical imaging and radiation sciences, vol. 55, no. 4, p. 101333, 2024

2024
[2]

Fine-grained image-text alignment in medical imaging enables explainable cyclic image-report generation,

W. Chen, L. Shen, J. Lin, J. Luo, X. Li, and Y . Yuan, “Fine-grained image-text alignment in medical imaging enables explainable cyclic image-report generation,”arXiv preprint arXiv:2312.08078, 2023

work page arXiv 2023
[3]

From vision to text: A comprehensive review of natural image captioning in medical diagnosis and radiology report generation,

G. Reale-Nosei, E. Amador-Dom ´ınguez, and E. Serrano, “From vision to text: A comprehensive review of natural image captioning in medical diagnosis and radiology report generation,”Medical Image Analysis, p. 103264, 2024

2024
[4]

Vision-language models for medical report generation and visual question answering: A review,

I. Hartsock and G. Rasool, “Vision-language models for medical report generation and visual question answering: A review,”Frontiers in Artificial Intelligence, vol. 7, p. 1430984, 2024

2024
[5]

Show, attend and tell: Neural image caption generation with visual attention,

K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, and Y . Bengio, “Show, attend and tell: Neural image caption generation with visual attention,” inInternational conference on machine learning. PMLR, 2015, pp. 2048–2057

2015
[6]

On the automatic generation of medical imaging reports,

B. Jing, P. Xie, and E. Xing, “On the automatic generation of medical imaging reports,”arXiv preprint arXiv:1711.08195, 2017

work page arXiv 2017
[7]

Mlip: Enhancing medical visual representation with divergence encoder and knowledge-guided contrastive learning,

Z. Li, L. T. Yang, B. Ren, X. Nie, Z. Gao, C. Tan, and S. Z. Li, “Mlip: Enhancing medical visual representation with divergence encoder and knowledge-guided contrastive learning,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 11 704–11 714

2024
[8]

Diverse and coherent paragraph generation from images,

M. Chatterjee and A. G. Schwing, “Diverse and coherent paragraph generation from images,” inProceedings of the European conference on computer vision (ECCV), 2018, pp. 729–744

2018
[9]

Chexpert: A large chest radiograph dataset with uncertainty labels and expert comparison,

J. Irvin, P. Rajpurkar, M. Ko, Y . Yu, S. Ciurea-Ilcus, C. Chute, H. Mark- lund, B. Haghgoo, R. Ball, K. Shpanskayaet al., “Chexpert: A large chest radiograph dataset with uncertainty labels and expert comparison,” inProceedings of the AAAI conference on artificial intelligence, vol. 33, no. 01, 2019, pp. 590–597

2019
[10]

Cross-modal prototype driven network for radiology report generation,

J. Wang, A. Bhalerao, and Y . He, “Cross-modal prototype driven network for radiology report generation,” inEuropean Conference on Computer Vision. Springer, 2022, pp. 563–579

2022
[11]

Generating radiology re- ports via memory-driven transformer,

Z. Chen, Y . Song, T.-H. Chang, and X. Wan, “Generating radiology re- ports via memory-driven transformer,”arXiv preprint arXiv:2010.16056, 2020

work page arXiv 2010
[12]

Camanet: class activation map guided attention network for radiology report generation,

J. Wang, A. Bhalerao, T. Yin, S. See, and Y . He, “Camanet: class activation map guided attention network for radiology report generation,” IEEE Journal of Biomedical and Health Informatics, vol. 28, no. 4, pp. 2199–2210, 2024

2024
[13]

Medunifier: Unifying vision-and-language pre-training on medical data with vision generation task using discrete visual representations,

Z. Zhang, Y . Yu, Y . Chen, X. Yang, and S. Y . Yeo, “Medunifier: Unifying vision-and-language pre-training on medical data with vision generation task using discrete visual representations,”arXiv preprint arXiv:2503.01019, 2025

work page arXiv 2025
[14]

Cross-modal memory networks for radiology report generation,

Z. Chen, Y . Shen, Y . Song, and X. Wan, “Cross-modal memory networks for radiology report generation,”arXiv preprint arXiv:2204.13258, 2022

work page arXiv 2022
[15]

Dynamic graph enhanced contrastive learning for chest x-ray report generation,

M. Li, B. Lin, Z. Chen, H. Lin, X. Liang, and X. Chang, “Dynamic graph enhanced contrastive learning for chest x-ray report generation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 3334–3343

2023
[16]

Kiut: Knowledge-injected u- transformer for radiology report generation,

Z. Huang, X. Zhang, and S. Zhang, “Kiut: Knowledge-injected u- transformer for radiology report generation,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 19 809–19 818

2023
[17]

Cross- modal clinical graph transformer for ophthalmic report generation,

M. Li, W. Cai, K. Verspoor, S. Pan, X. Liang, and X. Chang, “Cross- modal clinical graph transformer for ophthalmic report generation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 20 656–20 665

2022
[18]

Automatic generation of medical report with knowledge graph,

H. Zhao, J. Chen, L. Huang, T. Yang, W. Ding, and C. Li, “Automatic generation of medical report with knowledge graph,” inProceedings of the 2021 10th International Conference on Computing and Pattern Recognition, 2021, pp. 1–1

2021
[19]

When radiology report generation meets knowledge graph,

Y . Zhang, X. Wang, Z. Xu, Q. Yu, A. Yuille, and D. Xu, “When radiology report generation meets knowledge graph,” inProceedings of the AAAI conference on artificial intelligence, vol. 34, no. 07, 2020, pp. 12 910–12 917

2020
[20]

Radiology report generation with a learned knowledge base and multi-modal alignment,

S. Yang, X. Wu, S. Ge, Z. Zheng, S. K. Zhou, and L. Xiao, “Radiology report generation with a learned knowledge base and multi-modal alignment,”Medical Image Analysis, vol. 86, p. 102798, 2023

2023
[21]

Knowledge matters: Chest radiology report generation with general and specific knowledge,

S. Yang, X. Wu, S. Ge, S. K. Zhou, and L. Xiao, “Knowledge matters: Chest radiology report generation with general and specific knowledge,” Medical image analysis, vol. 80, p. 102510, 2022

2022
[22]

Unify, align and refine: Multi-level semantic alignment for radiology report generation,

Y . Li, B. Yang, X. Cheng, Z. Zhu, H. Li, and Y . Zou, “Unify, align and refine: Multi-level semantic alignment for radiology report generation,” inProceedings of the IEEE/CVF international conference on computer vision, 2023, pp. 2863–2874

2023
[23]

Multi-grained radiology report generation with sentence-level image-language contrastive learning,

A. Liu, Y . Guo, J.-h. Yong, and F. Xu, “Multi-grained radiology report generation with sentence-level image-language contrastive learning,” IEEE Transactions on Medical Imaging, vol. 43, no. 7, pp. 2657–2669, 2024

2024
[24]

Level set segmentation with robust image gradient energy and statistical shape prior,

S. Y . Yeo, X. Xie, I. Sazonov, and P. Nithiarasu, “Level set segmentation with robust image gradient energy and statistical shape prior,” in2011 18th IEEE International Conference on Image Processing. IEEE, 2011, pp. 3397–3400

2011
[25]

Cardiac image segmentation by random walks with dynamic shape constraint,

X. Yang, Y . Su, R. Duan, H. Fan, S. Y . Yeo, C. Lim, L. Zhong, and R. S. Tan, “Cardiac image segmentation by random walks with dynamic shape constraint,”IET Computer Vision, vol. 10, no. 1, pp. 79–86, 2016

2016
[26]

A novel multi-task deep learning model for skin lesion segmentation and classification,

X. Yang, Z. Zeng, S. Y . Yeo, C. Tan, H. L. Tey, and Y . Su, “A novel multi-task deep learning model for skin lesion segmentation and classification,”arXiv preprint arXiv:1703.01025, 2017

work page arXiv 2017
[27]

Effdnet: A scribble- supervised medical image segmentation method with enhanced fore- ground feature discrimination,

J. Liu, S. Y . Tan, X. Yang, Y . Xu, and S. Y . Yeo, “Effdnet: A scribble- supervised medical image segmentation method with enhanced fore- ground feature discrimination,” inInternational Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, 2025, pp. 194–204

2025
[28]

Knowing when to look: Adaptive attention via a visual sentinel for image captioning,

J. Lu, C. Xiong, D. Parikh, and R. Socher, “Knowing when to look: Adaptive attention via a visual sentinel for image captioning,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 375–383

2017
[29]

Bottom-up and top-down attention for image captioning and visual question answering,

P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, and L. Zhang, “Bottom-up and top-down attention for image captioning and visual question answering,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 6077–6086

2018
[30]

Attention is all you need,

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,”Advances in neural information processing systems, vol. 30, 2017

2017
[31]

Context based text-generation using lstm networks,

S. Santhanam, “Context based text-generation using lstm networks,” arXiv preprint arXiv:2005.00048, 2020

work page arXiv 2005
[32]

Attention based automated radiology report generation using cnn and lstm,

M. Sirshar, M. F. K. Paracha, M. U. Akram, N. S. Alghamdi, S. Z. Y . Zaidi, and T. Fatima, “Attention based automated radiology report generation using cnn and lstm,”Plos one, vol. 17, no. 1, p. e0262209, 2022

2022
[33]

Self- critical sequence training for image captioning,

S. J. Rennie, E. Marcheret, Y . Mroueh, J. Ross, and V . Goel, “Self- critical sequence training for image captioning,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 7008–7024

2017
[34]

Towards diverse and natural image descriptions via a conditional gan,

B. Dai, S. Fidler, R. Urtasun, and D. Lin, “Towards diverse and natural image descriptions via a conditional gan,” inProceedings of the IEEE international conference on computer vision, 2017, pp. 2970–2979

2017
[35]

Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation,

J. Li, D. Li, C. Xiong, and S. Hoi, “Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation,” inInternational conference on machine learning. PMLR, 2022, pp. 12 888–12 900

2022
[36]

Oscar: Object-semantics aligned pre-training for vision-language tasks,

X. Li, X. Yin, C. Li, P. Zhang, X. Hu, L. Zhang, L. Wang, H. Hu, L. Dong, F. Weiet al., “Oscar: Object-semantics aligned pre-training for vision-language tasks,” inComputer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXX

2020
[37]

Springer, 2020, pp. 121–137

2020
[38]

Simvlm: Sim- ple visual language model pretraining with weak supervision

Z. Wang, J. Yu, A. W. Yu, Z. Dai, Y . Tsvetkov, and Y . Cao, “Simvlm: Simple visual language model pretraining with weak supervision,”arXiv preprint arXiv:2108.10904, 2021

work page arXiv 2021
[39]

Radiographic reports generation via retrieval enhanced cross-modal fusion,

X. Hou, Y . Luo, W. Song, Y . Guo, W. You, and S. Li, “Radiographic reports generation via retrieval enhanced cross-modal fusion,” in2024 IEEE International Conference on Bioinformatics and Biomedicine (BIBM). IEEE, 2024, pp. 2032–2039

2024
[40]

Spatio-temporal and retrieval-augmented modelling for chest x-ray report generation,

Y . Yang, X. You, K. Zhang, Z. Fu, X. Wang, J. Ding, J. Sun, Z. Yu, Q. Huang, W. Hanet al., “Spatio-temporal and retrieval-augmented modelling for chest x-ray report generation,”IEEE Transactions on Medical Imaging, 2025

2025
[41]

Tienet: Text- image embedding network for common thorax disease classification and reporting in chest x-rays,

X. Wang, Y . Peng, L. Lu, Z. Lu, and R. M. Summers, “Tienet: Text- image embedding network for common thorax disease classification and reporting in chest x-rays,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 9049–9058

2018
[42]

An organ-aware diagnosis framework for radiology report generation,

S. Li, P. Qiao, L. Wang, M. Ning, L. Yuan, Y . Zheng, and J. Chen, “An organ-aware diagnosis framework for radiology report generation,” 14 IEEE TRANSACTIONS AND JOURNALS TEMPLATE IEEE Transactions on Medical Imaging, vol. 43, no. 12, pp. 4253–4265, 2024

2024
[43]

Diffusion networks with task-specific noise control for radiology report generation,

Y . Tian, F. Xia, and Y . Song, “Diffusion networks with task-specific noise control for radiology report generation,” inProceedings of the 32nd ACM International Conference on Multimedia, 2024, pp. 1771– 1780

2024
[44]

Improving radiology report generation with multi-grained abnormality prediction,

Y . Jin, W. Chen, Y . Tian, Y . Song, and C. Yan, “Improving radiology report generation with multi-grained abnormality prediction,”Neurocom- puting, vol. 600, p. 128122, 2024

2024
[45]

Multi- granularity cross-modal alignment for generalized medical visual repre- sentation learning,

F. Wang, Y . Zhou, S. Wang, V . Vardhanabhuti, and L. Yu, “Multi- granularity cross-modal alignment for generalized medical visual repre- sentation learning,”Advances in neural information processing systems, vol. 35, pp. 33 536–33 549, 2022

2022
[46]

Unraveling cross- modality knowledge conflicts in large vision-language models,

T. Zhu, Q. Liu, F. Wang, Z. Tu, and M. Chen, “Unraveling cross- modality knowledge conflicts in large vision-language models,”arXiv preprint arXiv:2410.03659, 2024

work page arXiv 2024
[47]

Clip-adapter: Better vision-language models with feature adapters,

P. Gao, S. Geng, R. Zhang, T. Ma, R. Fang, Y . Zhang, H. Li, and Y . Qiao, “Clip-adapter: Better vision-language models with feature adapters,” International Journal of Computer Vision, vol. 132, no. 2, pp. 581–595, 2024

2024
[48]

Achieving cross modal generalization with multimodal unified representation,

Y . Xia, H. Huang, J. Zhu, and Z. Zhao, “Achieving cross modal generalization with multimodal unified representation,”Advances in Neural Information Processing Systems, vol. 36, pp. 63 529–63 541, 2023

2023
[49]

Multimodal transformer for unaligned multimodal language sequences,

Y .-H. H. Tsai, S. Bai, P. P. Liang, J. Z. Kolter, L.-P. Morency, and R. Salakhutdinov, “Multimodal transformer for unaligned multimodal language sequences,” inProceedings of the conference. Association for computational linguistics. Meeting, vol. 2019, 2019, p. 6558

2019
[50]

Disentangled representation learning for multimodal emotion recognition,

D. Yang, S. Huang, H. Kuang, Y . Du, and L. Zhang, “Disentangled representation learning for multimodal emotion recognition,” inProceed- ings of the 30th ACM international conference on multimedia, 2022, pp. 1642–1651

2022
[51]

Modality translation- based multimodal sentiment analysis under uncertain missing modali- ties,

Z. Liu, B. Zhou, D. Chu, Y . Sun, and L. Meng, “Modality translation- based multimodal sentiment analysis under uncertain missing modali- ties,”Information Fusion, vol. 101, p. 101973, 2024

2024
[52]

Multimodal disentanglement variational autoencoders for zero-shot cross-modal re- trieval,

J. Tian, K. Wang, X. Xu, Z. Cao, F. Shen, and H. T. Shen, “Multimodal disentanglement variational autoencoders for zero-shot cross-modal re- trieval,” inProceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2022, pp. 960– 969

2022
[53]

Disentanglement translation network for multimodal sentiment analysis,

Y . Zeng, W. Yan, S. Mai, and H. Hu, “Disentanglement translation network for multimodal sentiment analysis,”Information Fusion, vol. 102, p. 102031, 2024

2024
[54]

Decoupled multimodal distilling for emotion recognition,

Y . Li, Y . Wang, and Z. Cui, “Decoupled multimodal distilling for emotion recognition,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 6631–6640

2023
[55]

A unified self-distillation framework for multimodal sentiment analysis with uncertain missing modalities,

M. Li, D. Yang, Y . Lei, S. Wang, S. Wang, L. Su, K. Yang, Y . Wang, M. Sun, and L. Zhang, “A unified self-distillation framework for multimodal sentiment analysis with uncertain missing modalities,” in Proceedings of the AAAI conference on artificial intelligence, vol. 38, no. 9, 2024, pp. 10 074–10 082

2024
[56]

Learning with a wasserstein loss,

C. Frogner, C. Zhang, H. Mobahi, M. Araya, and T. A. Poggio, “Learning with a wasserstein loss,”Advances in neural information processing systems, vol. 28, 2015

2015
[57]

Improving diffusion-based image synthesis with context pre- diction,

L. Yang, J. Liu, S. Hong, Z. Zhang, Z. Huang, Z. Cai, W. Zhang, and B. Cui, “Improving diffusion-based image synthesis with context pre- diction,”Advances in Neural Information Processing Systems, vol. 36, pp. 37 636–37 656, 2023

2023
[58]

The distribution of a product from several sources to numerous localities,

F. L. Hitchcock, “The distribution of a product from several sources to numerous localities,”Journal of mathematics and physics, vol. 20, no. 1-4, pp. 224–230, 1941

1941
[59]

Cosine similarity to determine similarity measure: Study case in online essay assessment,

A. R. Lahitani, A. E. Permanasari, and N. A. Setiawan, “Cosine similarity to determine similarity measure: Study case in online essay assessment,” in2016 4th International conference on cyber and IT service management. IEEE, 2016, pp. 1–6

2016
[60]

Kullback-leibler divergence,

S. Kullback, “Kullback-leibler divergence,” 1951

1951
[61]

Various issues around the l1-norm distance,

J.-D. Rolle, “Various issues around the l1-norm distance,”arXiv preprint arXiv:2110.04787, 2021

work page arXiv 2021
[62]

Euclidean distance matrices: essential theory, algorithms, and applications,

I. Dokmanic, R. Parhizkar, J. Ranieri, and M. Vetterli, “Euclidean distance matrices: essential theory, algorithms, and applications,”IEEE Signal Processing Magazine, vol. 32, no. 6, pp. 12–30, 2015

2015
[63]

Lightspeed computation of optimal transportation distances,

M. Cuturi, “Lightspeed computation of optimal transportation distances,” Advances in Neural Information Processing Systems, vol. 26, no. 2, pp. 2292–2300, 2013

2013
[64]

A point set generation network for 3d object reconstruction from a single image,

H. Fan, H. Su, and L. J. Guibas, “A point set generation network for 3d object reconstruction from a single image,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 605–613

2017
[65]

A shortest augmenting path algorithm for dense and sparse linear assignment problems,

R. Jonker and T. V olgenant, “A shortest augmenting path algorithm for dense and sparse linear assignment problems,” inDGOR/NSOR: Papers of the 16th Annual Meeting of DGOR in Cooperation with NSOR/Vortr¨age der 16. Jahrestagung der DGOR zusammen mit der NSOR. Springer, 1988, pp. 622–622

1988
[66]

S. Bird, E. Klein, and E. Loper,Natural language processing with Python: analyzing text with the natural language toolkit. ” O’Reilly Media, Inc.”, 2009

2009
[67]

Villaniet al.,Optimal transport: old and new

C. Villaniet al.,Optimal transport: old and new. Springer, 2008, vol. 338

2008
[68]

Self-attention with relative position repre- sentations

P. Shaw, J. Uszkoreit, and A. Vaswani, “Self-attention with relative position representations,”arXiv preprint arXiv:1803.02155, 2018

work page arXiv 2018
[69]

Rethinking and improv- ing relative position encoding for vision transformer,

K. Wu, H. Peng, M. Chen, J. Fu, and H. Chao, “Rethinking and improv- ing relative position encoding for vision transformer,” inProceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 10 033–10 041

2021
[70]

Preparing a collection of radiology examinations for distribution and retrieval,

D. Demner-Fushman, M. D. Kohli, M. B. Rosenman, S. E. Shooshan, L. Rodriguez, S. Antani, G. R. Thoma, and C. J. McDonald, “Preparing a collection of radiology examinations for distribution and retrieval,” Journal of the American Medical Informatics Association, vol. 23, no. 2, pp. 304–310, 2016

2016
[71]

Hybrid retrieval-generation reinforced agent for medical image report generation,

Y . Li, X. Liang, Z. Hu, and E. P. Xing, “Hybrid retrieval-generation reinforced agent for medical image report generation,”Advances in neural information processing systems, vol. 31, 2018

2018
[72]

Exploring and distilling posterior and prior knowledge for radiology report generation,

F. Liu, X. Wu, S. Ge, W. Fan, and Y . Zou, “Exploring and distilling posterior and prior knowledge for radiology report generation,” inPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 13 753–13 762

2021
[73]

arXiv:1901.07042 (2019) 10 A.Rafferty et al

A. E. Johnson, T. J. Pollard, N. R. Greenbaum, M. P. Lungren, C.-y. Deng, Y . Peng, Z. Lu, R. G. Mark, S. J. Berkowitz, and S. Horng, “Mimic-cxr-jpg, a large publicly available database of labeled chest radiographs,”arXiv preprint arXiv:1901.07042, 2019

work page arXiv 1901
[74]

Bleu: a method for automatic evaluation of machine translation,

K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “Bleu: a method for automatic evaluation of machine translation,” inProceedings of the 40th annual meeting of the Association for Computational Linguistics, 2002, pp. 311–318

2002
[75]

Meteor 1.3: Automatic metric for re- liable optimization and evaluation of machine translation systems,

M. Denkowski and A. Lavie, “Meteor 1.3: Automatic metric for re- liable optimization and evaluation of machine translation systems,” in Proceedings of the sixth workshop on statistical machine translation, 2011, pp. 85–91

2011
[76]

Rouge: A package for automatic evaluation of summaries,

C.-Y . Lin, “Rouge: A package for automatic evaluation of summaries,” inText summarization branches out, 2004, pp. 74–81

2004
[77]

Cider: Consensus- based image description evaluation,

R. Vedantam, C. Lawrence Zitnick, and D. Parikh, “Cider: Consensus- based image description evaluation,” inProceedings of the IEEE confer- ence on computer vision and pattern recognition, 2015, pp. 4566–4575

2015
[78]

Competence-based multimodal curriculum learning for medical report generation,

F. Liu, S. Ge, Y . Zou, and X. Wu, “Competence-based multimodal curriculum learning for medical report generation,”arXiv preprint arXiv:2206.14579, 2022

work page arXiv 2022
[79]

Interactive and ex- plainable region-guided radiology report generation,

T. Tanida, P. M ¨uller, G. Kaissis, and D. Rueckert, “Interactive and ex- plainable region-guided radiology report generation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 7433–7442

2023
[80]

Promptmrg: Diagnosis-driven prompts for medical report generation,

H. Jin, H. Che, Y . Lin, and H. Chen, “Promptmrg: Diagnosis-driven prompts for medical report generation,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 3, 2024, pp. 2607– 2615

2024

Showing first 80 references.