Medical Report Generation: A Hierarchical Task Structure-Based Cross-Modal Causal Intervention Framework

Junhao Li; Yifan Ge; Yucheng Song; Zhifang Liao; Zhining Liao

arxiv: 2511.02271 · v2 · pith:2L2PRTJTnew · submitted 2025-11-04 · 💻 cs.CV

Medical Report Generation: A Hierarchical Task Structure-Based Cross-Modal Causal Intervention Framework

Yucheng Song , Yifan Ge , Junhao Li , Zhining Liao , Zhifang Liao This is my paper

Pith reviewed 2026-05-18 01:41 UTC · model grok-4.3

classification 💻 cs.CV

keywords Medical Report GenerationCross-Modal LearningCausal InterventionHierarchical Task DecompositionFront-door InterventionRadiology Image AnalysisMultimodal AlignmentSpurious Correlation Reduction

0 comments

The pith

A hierarchical framework splits medical report generation into low-, mid-, and high-level tasks plus front-door causal intervention to fix domain knowledge gaps, entity misalignment, and spurious correlations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Medical report generation from radiology images faces three persistent problems: models lack sufficient medical domain knowledge, visual and textual entity embeddings fail to align properly, and cross-modal data introduces spurious correlations that distort outputs. This paper claims that decomposing the overall task into a low-level spatial alignment step for entities, a mid-level mutual guidance step using prefix language modeling and masked image modeling, and a high-level causal intervention step via front-door adjustment lets a single model address all three issues at once. If the claim holds, automated reports would become more accurate and interpretable because each level targets a distinct source of error rather than treating them in isolation. The paper shows this structure outperforms prior methods that tackled only one challenge at a time. Readers should care because reliable report generation could meaningfully lighten the workload for radiologists while preserving clinical trust.

Core claim

The HTSC-CIF framework classifies the three core challenges of medical report generation into low-, mid-, and high-level tasks. At the low level, medical entity features are aligned with spatial locations inside the visual encoder to strengthen domain knowledge. At the mid level, Prefix Language Modeling on text and Masked Image Modeling on images provide mutual guidance that improves cross-modal entity embedding alignment. At the high level, a cross-modal causal intervention module applies front-door intervention to block confounders and increase interpretability. Extensive experiments demonstrate that this combined structure significantly outperforms state-of-the-art medical report methods

What carries the argument

The HTSC-CIF framework, which decomposes medical report generation into three task levels and adds a cross-modal causal intervention module that performs front-door intervention to reduce confounders.

If this is right

Low-level spatial alignment of entities supplies visual encoders with explicit medical domain structure that improves feature quality for lesion description.
Mid-level mutual guidance between prefix language modeling and masked image modeling produces tighter cross-modal entity embeddings that reduce misalignment errors.
High-level front-door intervention removes spurious cross-modal correlations, yielding reports that depend more on causal image features than on dataset biases.
Jointly applying all three levels produces higher overall performance on standard medical report generation benchmarks than methods addressing any single challenge.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same three-level decomposition could be tested on related multimodal medical tasks such as automated diagnosis prediction from combined image and text data.
Front-door intervention at the high level might be replaced or augmented with other causal identification strategies if front-door assumptions prove too restrictive in new datasets.
The hierarchical structure suggests a general pattern for multimodal generation problems where domain knowledge, alignment, and bias issues coexist.
If the approach generalizes, it could reduce the need for heavy post-hoc explanation techniques by building interpretability directly into the causal module.

Load-bearing premise

The specific combination of low-level spatial alignment, mid-level mutual guidance via prefix and masked modeling, and high-level front-door intervention will together resolve the three stated challenges without introducing new confounders or harming generalization.

What would settle it

Train the model on a dataset where spurious image-report correlations are deliberately strengthened while keeping entity alignment and domain knowledge constant, then measure whether report accuracy and causal robustness drop below the level of non-intervened baselines.

Figures

Figures reproduced from arXiv: 2511.02271 by Junhao Li, Yifan Ge, Yucheng Song, Zhifang Liao, Zhining Liao.

**Figure 1.** Figure 1: Multi-level task design of HTSC-CIF. challenge necessitates the exploration of automated Medical Report Generation (MRG) systems. However, current MRG also faces several challenges: 1) How to incorporate rich domain knowledge into the model to improve the accuracy and reliability of the report. Medical images contain a significant amount of specialized information, which often requires a deep medical ba… view at source ↗

**Figure 2.** Figure 2: The overall structure of HTSC-CIF. (a) Domain Knowledge Enhancement Module. (b) Cross-Modal Alignment Module. (c) [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Description of causal structural modeling. [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Qualitative examples of HTSC-CIF on MIMIC-CXR. [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

read the original abstract

Medical Report Generation (MRG) is a key part of modern medical diagnostics, as it automatically generates reports from radiological images to reduce radiologists' burden. However, reliable MRG models for lesion description face three main challenges: insufficient domain knowledge understanding, poor text-visual entity embedding alignment, and spurious correlations from cross-modal biases. Previous work only addresses single challenges, while this paper tackles all three via a novel hierarchical task decomposition approach, proposing the HTSC-CIF framework. HTSC-CIF classifies the three challenges into low-, mid-, and high-level tasks: 1) Low-level: align medical entity features with spatial locations to enhance domain knowledge for visual encoders; 2) Mid-level: use Prefix Language Modeling (text) and Masked Image Modeling (images) to boost cross-modal alignment via mutual guidance; 3) High-level: a cross-modal causal intervention module (via front-door intervention) to reduce confounders and improve interpretability. Extensive experiments confirm HTSC-CIF's effectiveness, significantly outperforming state-of-the-art (SOTA) MRG methods. Code will be made public upon paper acceptance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper layers low/mid/high tasks with front-door causal intervention to tackle three MRG challenges together, but the intervention assumptions look under-checked for radiology data.

read the letter

The main takeaway is that this work decomposes medical report generation into low-level spatial alignment for domain knowledge, mid-level mutual guidance via prefix and masked modeling for better entity alignment, and high-level front-door intervention to cut spurious cross-modal correlations. That unified structure is the clearest novelty compared with earlier papers that handled one issue at a time. The authors report consistent gains over SOTA baselines on standard datasets and plan to release code, which helps with checking the results later. The layered design gives a practical way to organize what are usually competing objectives in vision-language medical models. The experiments appear to include ablations that isolate each level, which is useful even if the absolute numbers are not dramatic. The soft spot sits in the high-level module. Front-door intervention needs a clean mediator and no unmeasured confounders between the aligned embeddings and the generated tokens. Medical images and reports share many latent clinical factors, so it is not obvious the intervention actually removes bias rather than adding capacity. Without sensitivity checks or explicit do-calculus verification in the radiology setting, the causal claim rests more on modeling choice than on demonstrated robustness. The rest of the modeling and citation choices look standard for the area and do not show obvious circularity. This paper is for researchers working on multimodal report generation or causal methods in medical imaging. Someone already building or evaluating these systems would get concrete ideas from the task breakdown and the intervention module. It deserves a serious referee because the problem is real, the experiments are reported, and the code release makes follow-up feasible. I would send it to review and ask the authors to add checks on the no-unmeasured-confounding assumption or show that performance holds under plausible violations.

Referee Report

2 major / 2 minor

Summary. The paper proposes HTSC-CIF, a hierarchical framework for medical report generation (MRG) that decomposes three challenges—insufficient domain knowledge, poor text-visual entity alignment, and spurious cross-modal correlations—into low-, mid-, and high-level tasks. Low-level aligns entity features with spatial locations in visual encoders; mid-level employs Prefix Language Modeling and Masked Image Modeling for mutual cross-modal guidance; high-level introduces a cross-modal causal intervention module based on front-door intervention to block confounders and improve interpretability. The authors claim that extensive experiments demonstrate significant outperformance over state-of-the-art MRG methods.

Significance. If the central claims hold, the hierarchical decomposition combined with explicit causal intervention offers a principled way to jointly address domain knowledge, alignment, and bias issues that prior single-challenge methods leave unresolved. The use of front-door intervention for interpretability in cross-modal generation is a distinctive technical contribution that could influence future work on reliable, bias-reduced medical report models.

major comments (2)

[§3.3] §3.3 (High-level Cross-Modal Causal Intervention): The front-door identification formula is invoked with mid-level aligned entity embeddings as mediator M, yet the manuscript supplies neither a do-calculus derivation confirming P(Y|do(X)) = ∑_m P(M=m|X) ∑_x P(Y|M=m,X=x) P(X=x) nor any sensitivity analysis for the required no-unmeasured-confounding assumption between M and generated report tokens. In radiology data, where visual features and textual entities share latent clinical factors, this assumption is load-bearing for the claim that spurious correlations are reduced; without verification the performance gains could be attributable to added capacity rather than causal blocking.
[§4] §4 (Experiments) and Table 2: While the abstract asserts outperformance of SOTA methods, the reported metrics, ablation studies isolating the causal module, and error analysis on entity alignment failures are not cross-referenced to the specific low/mid/high-level contributions. This makes it impossible to assess whether the hierarchical structure itself, rather than any single added component, drives the gains.

minor comments (2)

[§3] Notation for the mediator variable M and the intervention operator is introduced only in the high-level subsection; an explicit equation block early in §3 would improve readability.
[§3.2] The description of Prefix Language Modeling and Masked Image Modeling in the mid-level module would benefit from a short pseudocode listing or diagram showing the mutual-guidance flow.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments on our paper. We address each of the major comments below and outline the revisions we plan to make to strengthen the manuscript.

read point-by-point responses

Referee: [§3.3] §3.3 (High-level Cross-Modal Causal Intervention): The front-door identification formula is invoked with mid-level aligned entity embeddings as mediator M, yet the manuscript supplies neither a do-calculus derivation confirming P(Y|do(X)) = ∑_m P(M=m|X) ∑_x P(Y|M=m,X=x) P(X=x) nor any sensitivity analysis for the required no-unmeasured-confounding assumption between M and generated report tokens. In radiology data, where visual features and textual entities share latent clinical factors, this assumption is load-bearing for the claim that spurious correlations are reduced; without verification the performance gains could be attributable to added capacity rather than causal blocking.

Authors: We thank the referee for highlighting this point on the causal intervention. The front-door criterion is applied with mid-level aligned entity embeddings as mediator M to block spurious cross-modal correlations, following standard causal inference practice for front-door adjustment. We acknowledge that an explicit do-calculus derivation was not provided in the original manuscript. In the revision we will insert a step-by-step derivation confirming the identification formula P(Y|do(X)) = ∑_m P(M=m|X) ∑_x P(Y|M=m,X=x) P(X=x) under our hierarchical setting. Regarding the no-unmeasured-confounding assumption between M and Y, the low- and mid-level modules are explicitly designed to reduce shared latent clinical factors through entity alignment and mutual guidance; our existing ablations already separate capacity from the intervention effect. Nevertheless, we will add a dedicated discussion of this assumption together with a sensitivity analysis (e.g., via simulation bounds or proxy confounding metrics) to further substantiate that gains arise from causal blocking rather than parameter count alone. revision: yes
Referee: [§4] §4 (Experiments) and Table 2: While the abstract asserts outperformance of SOTA methods, the reported metrics, ablation studies isolating the causal module, and error analysis on entity alignment failures are not cross-referenced to the specific low/mid/high-level contributions. This makes it impossible to assess whether the hierarchical structure itself, rather than any single added component, drives the gains.

Authors: We agree that stronger explicit linkages between results and the three task levels would improve clarity. In the revised manuscript we will reorganize Section 4 to cross-reference every reported metric and ablation directly to the low-level (spatial entity alignment), mid-level (Prefix LM + Masked IM), and high-level (causal intervention) modules. We will expand the ablation tables to isolate the causal module’s incremental contribution and add a focused error analysis that attributes entity alignment failures to the mid-level task. These changes will make it possible to evaluate whether the hierarchical decomposition, rather than any isolated component, accounts for the observed improvements over SOTA methods. revision: yes

Circularity Check

0 steps flagged

No circularity: novel hierarchical modules and causal intervention introduced without self-referential reductions

full rationale

The paper presents HTSC-CIF as a new framework that decomposes MRG challenges into low-level spatial alignment, mid-level mutual guidance via prefix/masked modeling, and high-level front-door intervention. No equations, fitted parameters renamed as predictions, or self-citation chains are exhibited that would make any claimed result equivalent to its inputs by construction. Performance gains are attributed to experimental outperformance of SOTA methods rather than tautological definitions or unverified uniqueness theorems from the same authors. The derivation chain remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, mathematical axioms, or newly postulated entities; the framework description mentions modules but does not detail any fitted constants, background assumptions, or invented constructs.

pith-pipeline@v0.9.0 · 5739 in / 1195 out tokens · 59612 ms · 2026-05-18T01:41:31.446328+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

P(Y|do(X)) = sum_m P(M=m|X) sum_x P(Y|M=m,X=x) P(X=x)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

51 extracted references · 51 canonical work pages

[1]

Making the most of text semantics to improve biomedical vision–language processing

Benedikt Boecking, Naoto Usuyama, Shruthi Bannur, Daniel C Castro, Anton Schwaighofer, Stephanie Hyland, Maria Wetscherek, Tristan Naumann, Aditya Nori, Javier Alvarez-Valle, et al. Making the most of text semantics to improve biomedical vision–language processing. InEuro- pean conference on computer vision, pages 1–21. Springer,

work page
[2]

An causal xai diagnostic model for breast cancer based on mammography reports

Dehua Chen, Hongjin Zhao, Jianrong He, Qiao Pan, and Weiliang Zhao. An causal xai diagnostic model for breast cancer based on mammography reports. In2021 IEEE in- ternational conference on bioinformatics and biomedicine (BIBM), pages 3341–3349. IEEE, 2021. 2

work page 2021
[3]

Cross-modal causal intervention for medical report generation.arXiv preprint arXiv:2303.09117, 2023

Weixing Chen, Yang Liu, Ce Wang, Jiarui Zhu, Shen Zhao, Guanbin Li, Cheng-Lin Liu, and Liang Lin. Cross-modal causal intervention for medical report generation.arXiv preprint arXiv:2303.09117, 2023. 2, 3, 4

work page arXiv 2023
[4]

Cross-modal causal represen- tation learning for radiology report generation.IEEE Trans- actions on Image Processing, 34:2970–2985, 2025

Weixing Chen, Yang Liu, Ce Wang, Jiarui Zhu, Guanbin Li, Cheng-Lin Liu, and Liang Lin. Cross-modal causal represen- tation learning for radiology report generation.IEEE Trans- actions on Image Processing, 34:2970–2985, 2025. 7, 8

work page 2025
[5]

Generating radiology reports via memory-driven trans- former.arXiv preprint arXiv:2010.16056, 2020

Zhihong Chen, Yan Song, Tsung-Hui Chang, and Xiang Wan. Generating radiology reports via memory-driven trans- former.arXiv preprint arXiv:2010.16056, 2020. 8

work page arXiv 2010
[6]

Cross-modal memory networks for radiology report gener- ation

Zhihong Chen, Yaling Shen, Yan Song, and Xiang Wan. Cross-modal memory networks for radiology report gener- ation. InProceedings of the 59th Annual Meeting of the As- sociation for Computational Linguistics and the 11th Inter- national Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 5904–5914, Online, 2021. Association for C...

work page 2021
[7]

Cross-modal memory networks for radiology report gener- ation.arXiv preprint arXiv:2204.13258, 2022

Zhihong Chen, Yaling Shen, Yan Song, and Xiang Wan. Cross-modal memory networks for radiology report gener- ation.arXiv preprint arXiv:2204.13258, 2022. 3

work page arXiv 2022
[8]

Prior: Prototype representation joint learning from medical images and reports

Pujin Cheng, Li Lin, Junyan Lyu, Yijin Huang, Wenhan Luo, and Xiaoying Tang. Prior: Prototype representation joint learning from medical images and reports. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 21361–21371, 2023. 3

work page 2023
[9]

Difficulties in the interpretation of chest radiography.Comparative inter- pretation of CT and standard radiography of the chest, pages 27–49, 2011

Louke Delrue, Robert Gosselin, Bart Ilsen, An Van Lan- deghem, Johan de Mey, and Philippe Duyck. Difficulties in the interpretation of chest radiography.Comparative inter- pretation of CT and standard radiography of the chest, pages 27–49, 2011. 1

work page 2011
[10]

Preparing a collection of radiology examinations for distribution and re- trieval.Journal of the American Medical Informatics Asso- ciation, 23(2):304–310, 2016

Dina Demner-Fushman, Marc D Kohli, Marc B Rosen- man, Sonya E Shooshan, Laritza Rodriguez, Sameer Antani, George R Thoma, and Clement J McDonald. Preparing a collection of radiology examinations for distribution and re- trieval.Journal of the American Medical Informatics Asso- ciation, 23(2):304–310, 2016. 6

work page 2016
[11]

Vision-Language Models for Medical Report Generation and Visual Question Answering: A Review,

Iryna Hartsock and Ghulam Rasool. Vision-language models for medical report generation and visual question answering: A review.arXiv preprint arXiv:2403.02469, 2024. 2

work page arXiv 2024
[12]

Transfg: A trans- former architecture for fine-grained recognition

Ju He, Jie-Neng Chen, Shuai Liu, Adam Kortylewski, Cheng Yang, Yutong Bai, and Changhu Wang. Transfg: A trans- former architecture for fine-grained recognition. InProceed- ings of the AAAI conference on artificial intelligence, pages 852–860, 2022. 6

work page 2022
[13]

Masked autoencoders are scalable vision learners

Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Doll´ar, and Ross Girshick. Masked autoencoders are scalable vision learners. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16000– 16009, 2022. 4

work page 2022
[14]

Kiut: Knowledge-injected u-transformer for radiology report generation

Zhongzhen Huang, Xiaofan Zhang, and Shaoting Zhang. Kiut: Knowledge-injected u-transformer for radiology report generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19809– 19818, 2023. 2, 8

work page 2023
[15]

arXiv preprint arXiv:2106.14463 , year=

Saahil Jain, Ashwin Agrawal, Adriel Saporta, Steven QH Truong, Du Nguyen Duong, Tan Bui, Pierre Chambon, Yuhao Zhang, Matthew P Lungren, Andrew Y Ng, et al. Rad- graph: Extracting clinical entities and relations from radiol- ogy reports.arXiv preprint arXiv:2106.14463, 2021. 6

work page arXiv 2021
[16]

Promptmrg: Diagnosis-driven prompts for medical report generation

Haibo Jin, Haoxuan Che, Yi Lin, and Hao Chen. Promptmrg: Diagnosis-driven prompts for medical report generation. In Proceedings of the AAAI Conference on Artificial Intelli- gence, pages 2607–2615, 2024. 8

work page 2024
[17]

Mimic-cxr, a de- identified publicly available database of chest radiographs with free-text reports.Scientific data, 6(1):317, 2019

Alistair EW Johnson, Tom J Pollard, Seth J Berkowitz, Nathaniel R Greenbaum, Matthew P Lungren, Chih-ying Deng, Roger G Mark, and Steven Horng. Mimic-cxr, a de- identified publicly available database of chest radiographs with free-text reports.Scientific data, 6(1):317, 2019. 6

work page 2019
[18]

A causal perspective on dataset bias in machine learning for medical imaging.Nature Machine Intelligence, 6(2):138–146, 2024

Charles Jones, Daniel C Castro, Fabio De Sousa Ribeiro, Ozan Oktay, Melissa McCradden, and Ben Glocker. A causal perspective on dataset bias in machine learning for medical imaging.Nature Machine Intelligence, 6(2):138–146, 2024. 3

work page 2024
[19]

Dynamic graph enhanced contrastive learning for chest x-ray report generation

Mingjie Li, Bingqian Lin, Zicong Chen, Haokun Lin, Xi- aodan Liang, and Xiaojun Chang. Dynamic graph enhanced contrastive learning for chest x-ray report generation. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3334–3343, 2023. 2, 8

work page 2023
[20]

Unify, align and refine: Multi- level semantic alignment for radiology report generation

Yaowei Li, Bang Yang, Xuxin Cheng, Zhihong Zhu, Hongx- iang Li, and Yuexian Zou. Unify, align and refine: Multi- level semantic alignment for radiology report generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2863–2874, 2023. 3

work page 2023
[21]

Exploring and distilling posterior and prior knowl- edge for radiology report generation

Fenglin Liu, Xian Wu, Shen Ge, Wei Fan, and Yuexian Zou. Exploring and distilling posterior and prior knowl- edge for radiology report generation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 13753–13762, 2021. 2

work page 2021
[22]

Contrastive attention for automatic chest x-ray report generation.arXiv preprint arXiv:2106.06965, 2021

Fenglin Liu, Changchang Yin, Xian Wu, Shen Ge, Yuex- ian Zou, Ping Zhang, and Xu Sun. Contrastive attention for automatic chest x-ray report generation.arXiv preprint arXiv:2106.06965, 2021. 2

work page arXiv 2021
[23]

Auto-encoding knowledge graph for unsupervised medical report generation.Advances in Neural Information Process- ing Systems, 34:16266–16279, 2021

Fenglin Liu, Chenyu You, Xian Wu, Shen Ge, Xu Sun, et al. Auto-encoding knowledge graph for unsupervised medical report generation.Advances in Neural Information Process- ing Systems, 34:16266–16279, 2021. 3

work page 2021
[24]

In-context learning for zero-shot medical re- port generation

Rui Liu, Mingjie Li, Shen Zhao, Ling Chen, Xiaojun Chang, and Lina Yao. In-context learning for zero-shot medical re- port generation. InProceedings of the 32nd ACM Interna- tional Conference on Multimedia, pages 8721–8730, 2024. 8

work page 2024
[25]

Reinforced cross-modal alignment for radiology report generation

Han Qin and Yan Song. Reinforced cross-modal alignment for radiology report generation. InFindings of the Associa- tion for Computational Linguistics: ACL 2022, pages 448– 458, 2022. 2

work page 2022
[26]

Automatic radiology reports generation via memory align- ment network

Hongyu Shen, Mingtao Pei, Juncai Liu, and Zhaoxing Tian. Automatic radiology reports generation via memory align- ment network. InProceedings of the AAAI Conference on Artificial Intelligence, pages 4776–4783, 2024. 8

work page 2024
[27]

Interactive and explainable region-guided radiol- ogy report generation

Tim Tanida, Philip M ¨uller, Georgios Kaissis, and Daniel Rueckert. Interactive and explainable region-guided radiol- ogy report generation. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 7433–7442, 2023. 8

work page 2023
[28]

Memory- based cross-modal semantic alignment network for radiology report generation.IEEE Journal of Biomedical and Health Informatics, 2024

Yitian Tao, Liyan Ma, Jing Yu, and Han Zhang. Memory- based cross-modal semantic alignment network for radiology report generation.IEEE Journal of Biomedical and Health Informatics, 2024. 3

work page 2024
[29]

Xraygpt: Chest radiographs summarization using medical vision-language models

Omkar Thawkar, Abdelrahman Shaker, Sahal Shaji Mullap- pilly, Hisham Cholakkal, Rao Muhammad Anwer, Salman Khan, Jorma Laaksonen, and Fahad Shahbaz Khan. Xraygpt: Chest radiographs summarization using medical vision- language models.arXiv preprint arXiv:2306.07971, 2023. 8

work page arXiv 2023
[30]

Towards gen- eralist biomedical ai.Nejm Ai, 1(3):AIoa2300138, 2024

Tao Tu, Shekoofeh Azizi, Danny Driess, Mike Schaeker- mann, Mohamed Amin, Pi-Chuan Chang, Andrew Carroll, Charles Lau, Ryutaro Tanno, Ira Ktena, et al. Towards gen- eralist biomedical ai.Nejm Ai, 1(3):AIoa2300138, 2024. 8

work page 2024
[31]

doi:10.48550/arxiv.2206.05498 , arxivId =

Athanasios Vlontzos, Daniel Rueckert, and Bernhard Kainz. A review of causality for learning algorithms in medical im- age analysis.arXiv preprint arXiv:2206.05498, 2022. 3

work page arXiv 2022
[32]

Entity, relation, and event extraction with contextualized span representations.arXiv preprint arXiv:1909.03546, 2019

David Wadden, Ulme Wennberg, Yi Luan, and Han- naneh Hajishirzi. Entity, relation, and event extraction with contextualized span representations.arXiv preprint arXiv:1909.03546, 2019. 7

work page arXiv 1909
[33]

Cross-modal pro- totype driven network for radiology report generation

Jun Wang, Abhir Bhalerao, and Yulan He. Cross-modal pro- totype driven network for radiology report generation. In European Conference on Computer Vision, pages 563–579. Springer, 2022. 3

work page 2022
[34]

Causal attention for unbiased visual recognition

Tan Wang, Chang Zhou, Qianru Sun, and Hanwang Zhang. Causal attention for unbiased visual recognition. InProceed- ings of the IEEE/CVF International Conference on Com- puter Vision, pages 3091–3100, 2021. 6

work page 2021
[35]

Tienet: Text-image embedding net- work for common thorax disease classification and reporting in chest x-rays

Xiaosong Wang, Yifan Peng, Le Lu, Zhiyong Lu, and Ronald M Summers. Tienet: Text-image embedding net- work for common thorax disease classification and reporting in chest x-rays. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 9049–9058,

work page
[36]

arXiv preprint arXiv:2108.10904 , year=

Zirui Wang, Jiahui Yu, Adams Wei Yu, Zihang Dai, Yulia Tsvetkov, and Yuan Cao. Simvlm: Simple visual language model pretraining with weak supervision.arXiv preprint arXiv:2108.10904, 2021. 4

work page arXiv 2021
[37]

Metransformer: Radiology report generation by transformer with multiple learnable expert tokens

Zhanyu Wang, Lingqiao Liu, Lei Wang, and Luping Zhou. Metransformer: Radiology report generation by transformer with multiple learnable expert tokens. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11558–11567, 2023. 8

work page 2023
[38]

R2gengpt: Radiology report generation with frozen llms

Zhanyu Wang, Lingqiao Liu, Lei Wang, and Luping Zhou. R2gengpt: Radiology report generation with frozen llms. Meta-Radiology, 1(3):100033, 2023. 8

work page 2023
[39]

Medklip: Medical knowledge enhanced language-image pre-training for x-ray diagnosis

Chaoyi Wu, Xiaoman Zhang, Ya Zhang, Yanfeng Wang, and Weidi Xie. Medklip: Medical knowledge enhanced language-image pre-training for x-ray diagnosis. InProceed- ings of the IEEE/CVF International Conference on Com- puter Vision, pages 21372–21383, 2023. 2

work page 2023
[40]

Causal infer- ence in the medical domain: a survey.Applied Intelligence, pages 1–24, 2024

Xing Wu, Shaoqi Peng, Jingwen Li, Jian Zhang, Qun Sun, Weimin Li, Quan Qian, Yue Liu, and Yike Guo. Causal infer- ence in the medical domain: a survey.Applied Intelligence, pages 1–24, 2024. 3

work page 2024
[41]

A survey on incorporating do- main knowledge into deep learning for medical image anal- ysis.Medical Image Analysis, 69:101985, 2021

Xiaozheng Xie, Jianwei Niu, Xuefeng Liu, Zhengsu Chen, Shaojie Tang, and Shui Yu. A survey on incorporating do- main knowledge into deep learning for medical image anal- ysis.Medical Image Analysis, 69:101985, 2021. 3

work page 2021
[42]

Vision-knowledge fusion model for multi-domain medical report generation.Infor- mation Fusion, 97:101817, 2023

Dexuan Xu, Huashi Zhu, Yu Huang, Zhi Jin, Weiping Ding, Hang Li, and Menglong Ran. Vision-knowledge fusion model for multi-domain medical report generation.Infor- mation Fusion, 97:101817, 2023. 3

work page 2023
[43]

Show, attend and tell: Neural image caption gen- eration with visual attention

Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. Show, attend and tell: Neural image caption gen- eration with visual attention. InInternational conference on machine learning, pages 2048–2057. PMLR, 2015. 5

work page 2048
[44]

Attributed abnor- mality graph embedding for clinically accurate x-ray report generation.IEEE Transactions on Medical Imaging, 42(8): 2211–2222, 2023

Sixing Yan, William K Cheung, Keith Chiu, Terence M Tong, Ka Chun Cheung, and Simon See. Attributed abnor- mality graph embedding for clinically accurate x-ray report generation.IEEE Transactions on Medical Imaging, 42(8): 2211–2222, 2023. 3

work page 2023
[45]

Knowledge matters: Chest radiology report genera- tion with general and specific knowledge.Medical image analysis, 80:102510, 2022

Shuxin Yang, Xian Wu, Shen Ge, S Kevin Zhou, and Li Xiao. Knowledge matters: Chest radiology report genera- tion with general and specific knowledge.Medical image analysis, 80:102510, 2022. 3

work page 2022
[46]

Radiology report generation with a learned knowledge base and multi-modal alignment.Med- ical Image Analysis, 86:102798, 2023

Shuxin Yang, Xian Wu, Shen Ge, Zhuozhao Zheng, S Kevin Zhou, and Li Xiao. Radiology report generation with a learned knowledge base and multi-modal alignment.Med- ical Image Analysis, 86:102798, 2023. 3, 8

work page 2023
[47]

Deconfounded image captioning: A causal retrospect.IEEE Transac- tions on Pattern Analysis and Machine Intelligence, 45(11): 12996–13010, 2021

Xu Yang, Hanwang Zhang, and Jianfei Cai. Deconfounded image captioning: A causal retrospect.IEEE Transac- tions on Pattern Analysis and Machine Intelligence, 45(11): 12996–13010, 2021. 4

work page 2021
[48]

Aligntransformer: Hierarchical alignment of visual regions and disease tags for medical report genera- tion

Di You, Fenglin Liu, Shen Ge, Xiaoxia Xie, Jing Zhang, and Xian Wu. Aligntransformer: Hierarchical alignment of visual regions and disease tags for medical report genera- tion. InMedical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Pro- ceedings, Part III 2...

work page 2021
[49]

Anatomy-guided weakly- supervised abnormality localization in chest x-rays

Ke Yu, Shantanu Ghosh, Zhexiong Liu, Christopher Deible, and Kayhan Batmanghelich. Anatomy-guided weakly- supervised abnormality localization in chest x-rays. InIn- ternational Conference on Medical Image Computing and Computer-Assisted Intervention, pages 658–668. Springer,

work page
[50]

Au- tomatic radiology report generation based on multi-view image fusion and medical concept enrichment

Jianbo Yuan, Haofu Liao, Rui Luo, and Jiebo Luo. Au- tomatic radiology report generation based on multi-view image fusion and medical concept enrichment. InMedi- cal Image Computing and Computer Assisted Intervention– MICCAI 2019: 22nd International Conference, Shenzhen, China, October 13–17, 2019, Proceedings, Part VI 22, pages 721–729. Springer, 2019. 3

work page 2019
[51]

When radiology report generation meets knowledge graph

Yixiao Zhang, Xiaosong Wang, Ziyue Xu, Qihang Yu, Alan Yuille, and Daguang Xu. When radiology report generation meets knowledge graph. InProceedings of the AAAI con- ference on artificial intelligence, pages 12910–12917, 2020. 3

work page 2020

[1] [1]

Making the most of text semantics to improve biomedical vision–language processing

Benedikt Boecking, Naoto Usuyama, Shruthi Bannur, Daniel C Castro, Anton Schwaighofer, Stephanie Hyland, Maria Wetscherek, Tristan Naumann, Aditya Nori, Javier Alvarez-Valle, et al. Making the most of text semantics to improve biomedical vision–language processing. InEuro- pean conference on computer vision, pages 1–21. Springer,

work page

[2] [2]

An causal xai diagnostic model for breast cancer based on mammography reports

Dehua Chen, Hongjin Zhao, Jianrong He, Qiao Pan, and Weiliang Zhao. An causal xai diagnostic model for breast cancer based on mammography reports. In2021 IEEE in- ternational conference on bioinformatics and biomedicine (BIBM), pages 3341–3349. IEEE, 2021. 2

work page 2021

[3] [3]

Cross-modal causal intervention for medical report generation.arXiv preprint arXiv:2303.09117, 2023

Weixing Chen, Yang Liu, Ce Wang, Jiarui Zhu, Shen Zhao, Guanbin Li, Cheng-Lin Liu, and Liang Lin. Cross-modal causal intervention for medical report generation.arXiv preprint arXiv:2303.09117, 2023. 2, 3, 4

work page arXiv 2023

[4] [4]

Cross-modal causal represen- tation learning for radiology report generation.IEEE Trans- actions on Image Processing, 34:2970–2985, 2025

Weixing Chen, Yang Liu, Ce Wang, Jiarui Zhu, Guanbin Li, Cheng-Lin Liu, and Liang Lin. Cross-modal causal represen- tation learning for radiology report generation.IEEE Trans- actions on Image Processing, 34:2970–2985, 2025. 7, 8

work page 2025

[5] [5]

Generating radiology reports via memory-driven trans- former.arXiv preprint arXiv:2010.16056, 2020

Zhihong Chen, Yan Song, Tsung-Hui Chang, and Xiang Wan. Generating radiology reports via memory-driven trans- former.arXiv preprint arXiv:2010.16056, 2020. 8

work page arXiv 2010

[6] [6]

Cross-modal memory networks for radiology report gener- ation

Zhihong Chen, Yaling Shen, Yan Song, and Xiang Wan. Cross-modal memory networks for radiology report gener- ation. InProceedings of the 59th Annual Meeting of the As- sociation for Computational Linguistics and the 11th Inter- national Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 5904–5914, Online, 2021. Association for C...

work page 2021

[7] [7]

Cross-modal memory networks for radiology report gener- ation.arXiv preprint arXiv:2204.13258, 2022

Zhihong Chen, Yaling Shen, Yan Song, and Xiang Wan. Cross-modal memory networks for radiology report gener- ation.arXiv preprint arXiv:2204.13258, 2022. 3

work page arXiv 2022

[8] [8]

Prior: Prototype representation joint learning from medical images and reports

Pujin Cheng, Li Lin, Junyan Lyu, Yijin Huang, Wenhan Luo, and Xiaoying Tang. Prior: Prototype representation joint learning from medical images and reports. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 21361–21371, 2023. 3

work page 2023

[9] [9]

Difficulties in the interpretation of chest radiography.Comparative inter- pretation of CT and standard radiography of the chest, pages 27–49, 2011

Louke Delrue, Robert Gosselin, Bart Ilsen, An Van Lan- deghem, Johan de Mey, and Philippe Duyck. Difficulties in the interpretation of chest radiography.Comparative inter- pretation of CT and standard radiography of the chest, pages 27–49, 2011. 1

work page 2011

[10] [10]

Preparing a collection of radiology examinations for distribution and re- trieval.Journal of the American Medical Informatics Asso- ciation, 23(2):304–310, 2016

Dina Demner-Fushman, Marc D Kohli, Marc B Rosen- man, Sonya E Shooshan, Laritza Rodriguez, Sameer Antani, George R Thoma, and Clement J McDonald. Preparing a collection of radiology examinations for distribution and re- trieval.Journal of the American Medical Informatics Asso- ciation, 23(2):304–310, 2016. 6

work page 2016

[11] [11]

Vision-Language Models for Medical Report Generation and Visual Question Answering: A Review,

Iryna Hartsock and Ghulam Rasool. Vision-language models for medical report generation and visual question answering: A review.arXiv preprint arXiv:2403.02469, 2024. 2

work page arXiv 2024

[12] [12]

Transfg: A trans- former architecture for fine-grained recognition

Ju He, Jie-Neng Chen, Shuai Liu, Adam Kortylewski, Cheng Yang, Yutong Bai, and Changhu Wang. Transfg: A trans- former architecture for fine-grained recognition. InProceed- ings of the AAAI conference on artificial intelligence, pages 852–860, 2022. 6

work page 2022

[13] [13]

Masked autoencoders are scalable vision learners

Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Doll´ar, and Ross Girshick. Masked autoencoders are scalable vision learners. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16000– 16009, 2022. 4

work page 2022

[14] [14]

Kiut: Knowledge-injected u-transformer for radiology report generation

Zhongzhen Huang, Xiaofan Zhang, and Shaoting Zhang. Kiut: Knowledge-injected u-transformer for radiology report generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19809– 19818, 2023. 2, 8

work page 2023

[15] [15]

arXiv preprint arXiv:2106.14463 , year=

Saahil Jain, Ashwin Agrawal, Adriel Saporta, Steven QH Truong, Du Nguyen Duong, Tan Bui, Pierre Chambon, Yuhao Zhang, Matthew P Lungren, Andrew Y Ng, et al. Rad- graph: Extracting clinical entities and relations from radiol- ogy reports.arXiv preprint arXiv:2106.14463, 2021. 6

work page arXiv 2021

[16] [16]

Promptmrg: Diagnosis-driven prompts for medical report generation

Haibo Jin, Haoxuan Che, Yi Lin, and Hao Chen. Promptmrg: Diagnosis-driven prompts for medical report generation. In Proceedings of the AAAI Conference on Artificial Intelli- gence, pages 2607–2615, 2024. 8

work page 2024

[17] [17]

Mimic-cxr, a de- identified publicly available database of chest radiographs with free-text reports.Scientific data, 6(1):317, 2019

Alistair EW Johnson, Tom J Pollard, Seth J Berkowitz, Nathaniel R Greenbaum, Matthew P Lungren, Chih-ying Deng, Roger G Mark, and Steven Horng. Mimic-cxr, a de- identified publicly available database of chest radiographs with free-text reports.Scientific data, 6(1):317, 2019. 6

work page 2019

[18] [18]

A causal perspective on dataset bias in machine learning for medical imaging.Nature Machine Intelligence, 6(2):138–146, 2024

Charles Jones, Daniel C Castro, Fabio De Sousa Ribeiro, Ozan Oktay, Melissa McCradden, and Ben Glocker. A causal perspective on dataset bias in machine learning for medical imaging.Nature Machine Intelligence, 6(2):138–146, 2024. 3

work page 2024

[19] [19]

Dynamic graph enhanced contrastive learning for chest x-ray report generation

Mingjie Li, Bingqian Lin, Zicong Chen, Haokun Lin, Xi- aodan Liang, and Xiaojun Chang. Dynamic graph enhanced contrastive learning for chest x-ray report generation. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3334–3343, 2023. 2, 8

work page 2023

[20] [20]

Unify, align and refine: Multi- level semantic alignment for radiology report generation

Yaowei Li, Bang Yang, Xuxin Cheng, Zhihong Zhu, Hongx- iang Li, and Yuexian Zou. Unify, align and refine: Multi- level semantic alignment for radiology report generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2863–2874, 2023. 3

work page 2023

[21] [21]

Exploring and distilling posterior and prior knowl- edge for radiology report generation

Fenglin Liu, Xian Wu, Shen Ge, Wei Fan, and Yuexian Zou. Exploring and distilling posterior and prior knowl- edge for radiology report generation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 13753–13762, 2021. 2

work page 2021

[22] [22]

Contrastive attention for automatic chest x-ray report generation.arXiv preprint arXiv:2106.06965, 2021

Fenglin Liu, Changchang Yin, Xian Wu, Shen Ge, Yuex- ian Zou, Ping Zhang, and Xu Sun. Contrastive attention for automatic chest x-ray report generation.arXiv preprint arXiv:2106.06965, 2021. 2

work page arXiv 2021

[23] [23]

Auto-encoding knowledge graph for unsupervised medical report generation.Advances in Neural Information Process- ing Systems, 34:16266–16279, 2021

Fenglin Liu, Chenyu You, Xian Wu, Shen Ge, Xu Sun, et al. Auto-encoding knowledge graph for unsupervised medical report generation.Advances in Neural Information Process- ing Systems, 34:16266–16279, 2021. 3

work page 2021

[24] [24]

In-context learning for zero-shot medical re- port generation

Rui Liu, Mingjie Li, Shen Zhao, Ling Chen, Xiaojun Chang, and Lina Yao. In-context learning for zero-shot medical re- port generation. InProceedings of the 32nd ACM Interna- tional Conference on Multimedia, pages 8721–8730, 2024. 8

work page 2024

[25] [25]

Reinforced cross-modal alignment for radiology report generation

Han Qin and Yan Song. Reinforced cross-modal alignment for radiology report generation. InFindings of the Associa- tion for Computational Linguistics: ACL 2022, pages 448– 458, 2022. 2

work page 2022

[26] [26]

Automatic radiology reports generation via memory align- ment network

Hongyu Shen, Mingtao Pei, Juncai Liu, and Zhaoxing Tian. Automatic radiology reports generation via memory align- ment network. InProceedings of the AAAI Conference on Artificial Intelligence, pages 4776–4783, 2024. 8

work page 2024

[27] [27]

Interactive and explainable region-guided radiol- ogy report generation

Tim Tanida, Philip M ¨uller, Georgios Kaissis, and Daniel Rueckert. Interactive and explainable region-guided radiol- ogy report generation. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 7433–7442, 2023. 8

work page 2023

[28] [28]

Memory- based cross-modal semantic alignment network for radiology report generation.IEEE Journal of Biomedical and Health Informatics, 2024

Yitian Tao, Liyan Ma, Jing Yu, and Han Zhang. Memory- based cross-modal semantic alignment network for radiology report generation.IEEE Journal of Biomedical and Health Informatics, 2024. 3

work page 2024

[29] [29]

Xraygpt: Chest radiographs summarization using medical vision-language models

Omkar Thawkar, Abdelrahman Shaker, Sahal Shaji Mullap- pilly, Hisham Cholakkal, Rao Muhammad Anwer, Salman Khan, Jorma Laaksonen, and Fahad Shahbaz Khan. Xraygpt: Chest radiographs summarization using medical vision- language models.arXiv preprint arXiv:2306.07971, 2023. 8

work page arXiv 2023

[30] [30]

Towards gen- eralist biomedical ai.Nejm Ai, 1(3):AIoa2300138, 2024

Tao Tu, Shekoofeh Azizi, Danny Driess, Mike Schaeker- mann, Mohamed Amin, Pi-Chuan Chang, Andrew Carroll, Charles Lau, Ryutaro Tanno, Ira Ktena, et al. Towards gen- eralist biomedical ai.Nejm Ai, 1(3):AIoa2300138, 2024. 8

work page 2024

[31] [31]

doi:10.48550/arxiv.2206.05498 , arxivId =

Athanasios Vlontzos, Daniel Rueckert, and Bernhard Kainz. A review of causality for learning algorithms in medical im- age analysis.arXiv preprint arXiv:2206.05498, 2022. 3

work page arXiv 2022

[32] [32]

Entity, relation, and event extraction with contextualized span representations.arXiv preprint arXiv:1909.03546, 2019

David Wadden, Ulme Wennberg, Yi Luan, and Han- naneh Hajishirzi. Entity, relation, and event extraction with contextualized span representations.arXiv preprint arXiv:1909.03546, 2019. 7

work page arXiv 1909

[33] [33]

Cross-modal pro- totype driven network for radiology report generation

Jun Wang, Abhir Bhalerao, and Yulan He. Cross-modal pro- totype driven network for radiology report generation. In European Conference on Computer Vision, pages 563–579. Springer, 2022. 3

work page 2022

[34] [34]

Causal attention for unbiased visual recognition

Tan Wang, Chang Zhou, Qianru Sun, and Hanwang Zhang. Causal attention for unbiased visual recognition. InProceed- ings of the IEEE/CVF International Conference on Com- puter Vision, pages 3091–3100, 2021. 6

work page 2021

[35] [35]

Tienet: Text-image embedding net- work for common thorax disease classification and reporting in chest x-rays

Xiaosong Wang, Yifan Peng, Le Lu, Zhiyong Lu, and Ronald M Summers. Tienet: Text-image embedding net- work for common thorax disease classification and reporting in chest x-rays. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 9049–9058,

work page

[36] [36]

arXiv preprint arXiv:2108.10904 , year=

Zirui Wang, Jiahui Yu, Adams Wei Yu, Zihang Dai, Yulia Tsvetkov, and Yuan Cao. Simvlm: Simple visual language model pretraining with weak supervision.arXiv preprint arXiv:2108.10904, 2021. 4

work page arXiv 2021

[37] [37]

Metransformer: Radiology report generation by transformer with multiple learnable expert tokens

Zhanyu Wang, Lingqiao Liu, Lei Wang, and Luping Zhou. Metransformer: Radiology report generation by transformer with multiple learnable expert tokens. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11558–11567, 2023. 8

work page 2023

[38] [38]

R2gengpt: Radiology report generation with frozen llms

Zhanyu Wang, Lingqiao Liu, Lei Wang, and Luping Zhou. R2gengpt: Radiology report generation with frozen llms. Meta-Radiology, 1(3):100033, 2023. 8

work page 2023

[39] [39]

Medklip: Medical knowledge enhanced language-image pre-training for x-ray diagnosis

Chaoyi Wu, Xiaoman Zhang, Ya Zhang, Yanfeng Wang, and Weidi Xie. Medklip: Medical knowledge enhanced language-image pre-training for x-ray diagnosis. InProceed- ings of the IEEE/CVF International Conference on Com- puter Vision, pages 21372–21383, 2023. 2

work page 2023

[40] [40]

Causal infer- ence in the medical domain: a survey.Applied Intelligence, pages 1–24, 2024

Xing Wu, Shaoqi Peng, Jingwen Li, Jian Zhang, Qun Sun, Weimin Li, Quan Qian, Yue Liu, and Yike Guo. Causal infer- ence in the medical domain: a survey.Applied Intelligence, pages 1–24, 2024. 3

work page 2024

[41] [41]

A survey on incorporating do- main knowledge into deep learning for medical image anal- ysis.Medical Image Analysis, 69:101985, 2021

Xiaozheng Xie, Jianwei Niu, Xuefeng Liu, Zhengsu Chen, Shaojie Tang, and Shui Yu. A survey on incorporating do- main knowledge into deep learning for medical image anal- ysis.Medical Image Analysis, 69:101985, 2021. 3

work page 2021

[42] [42]

Vision-knowledge fusion model for multi-domain medical report generation.Infor- mation Fusion, 97:101817, 2023

Dexuan Xu, Huashi Zhu, Yu Huang, Zhi Jin, Weiping Ding, Hang Li, and Menglong Ran. Vision-knowledge fusion model for multi-domain medical report generation.Infor- mation Fusion, 97:101817, 2023. 3

work page 2023

[43] [43]

Show, attend and tell: Neural image caption gen- eration with visual attention

Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. Show, attend and tell: Neural image caption gen- eration with visual attention. InInternational conference on machine learning, pages 2048–2057. PMLR, 2015. 5

work page 2048

[44] [44]

Attributed abnor- mality graph embedding for clinically accurate x-ray report generation.IEEE Transactions on Medical Imaging, 42(8): 2211–2222, 2023

Sixing Yan, William K Cheung, Keith Chiu, Terence M Tong, Ka Chun Cheung, and Simon See. Attributed abnor- mality graph embedding for clinically accurate x-ray report generation.IEEE Transactions on Medical Imaging, 42(8): 2211–2222, 2023. 3

work page 2023

[45] [45]

Knowledge matters: Chest radiology report genera- tion with general and specific knowledge.Medical image analysis, 80:102510, 2022

Shuxin Yang, Xian Wu, Shen Ge, S Kevin Zhou, and Li Xiao. Knowledge matters: Chest radiology report genera- tion with general and specific knowledge.Medical image analysis, 80:102510, 2022. 3

work page 2022

[46] [46]

Radiology report generation with a learned knowledge base and multi-modal alignment.Med- ical Image Analysis, 86:102798, 2023

Shuxin Yang, Xian Wu, Shen Ge, Zhuozhao Zheng, S Kevin Zhou, and Li Xiao. Radiology report generation with a learned knowledge base and multi-modal alignment.Med- ical Image Analysis, 86:102798, 2023. 3, 8

work page 2023

[47] [47]

Deconfounded image captioning: A causal retrospect.IEEE Transac- tions on Pattern Analysis and Machine Intelligence, 45(11): 12996–13010, 2021

Xu Yang, Hanwang Zhang, and Jianfei Cai. Deconfounded image captioning: A causal retrospect.IEEE Transac- tions on Pattern Analysis and Machine Intelligence, 45(11): 12996–13010, 2021. 4

work page 2021

[48] [48]

Aligntransformer: Hierarchical alignment of visual regions and disease tags for medical report genera- tion

Di You, Fenglin Liu, Shen Ge, Xiaoxia Xie, Jing Zhang, and Xian Wu. Aligntransformer: Hierarchical alignment of visual regions and disease tags for medical report genera- tion. InMedical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Pro- ceedings, Part III 2...

work page 2021

[49] [49]

Anatomy-guided weakly- supervised abnormality localization in chest x-rays

Ke Yu, Shantanu Ghosh, Zhexiong Liu, Christopher Deible, and Kayhan Batmanghelich. Anatomy-guided weakly- supervised abnormality localization in chest x-rays. InIn- ternational Conference on Medical Image Computing and Computer-Assisted Intervention, pages 658–668. Springer,

work page

[50] [50]

Au- tomatic radiology report generation based on multi-view image fusion and medical concept enrichment

Jianbo Yuan, Haofu Liao, Rui Luo, and Jiebo Luo. Au- tomatic radiology report generation based on multi-view image fusion and medical concept enrichment. InMedi- cal Image Computing and Computer Assisted Intervention– MICCAI 2019: 22nd International Conference, Shenzhen, China, October 13–17, 2019, Proceedings, Part VI 22, pages 721–729. Springer, 2019. 3

work page 2019

[51] [51]

When radiology report generation meets knowledge graph

Yixiao Zhang, Xiaosong Wang, Ziyue Xu, Qihang Yu, Alan Yuille, and Daguang Xu. When radiology report generation meets knowledge graph. InProceedings of the AAAI con- ference on artificial intelligence, pages 12910–12917, 2020. 3

work page 2020