pith. sign in

arxiv: 2511.02271 · v2 · pith:2L2PRTJTnew · submitted 2025-11-04 · 💻 cs.CV

Medical Report Generation: A Hierarchical Task Structure-Based Cross-Modal Causal Intervention Framework

Pith reviewed 2026-05-18 01:41 UTC · model grok-4.3

classification 💻 cs.CV
keywords Medical Report GenerationCross-Modal LearningCausal InterventionHierarchical Task DecompositionFront-door InterventionRadiology Image AnalysisMultimodal AlignmentSpurious Correlation Reduction
0
0 comments X

The pith

A hierarchical framework splits medical report generation into low-, mid-, and high-level tasks plus front-door causal intervention to fix domain knowledge gaps, entity misalignment, and spurious correlations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Medical report generation from radiology images faces three persistent problems: models lack sufficient medical domain knowledge, visual and textual entity embeddings fail to align properly, and cross-modal data introduces spurious correlations that distort outputs. This paper claims that decomposing the overall task into a low-level spatial alignment step for entities, a mid-level mutual guidance step using prefix language modeling and masked image modeling, and a high-level causal intervention step via front-door adjustment lets a single model address all three issues at once. If the claim holds, automated reports would become more accurate and interpretable because each level targets a distinct source of error rather than treating them in isolation. The paper shows this structure outperforms prior methods that tackled only one challenge at a time. Readers should care because reliable report generation could meaningfully lighten the workload for radiologists while preserving clinical trust.

Core claim

The HTSC-CIF framework classifies the three core challenges of medical report generation into low-, mid-, and high-level tasks. At the low level, medical entity features are aligned with spatial locations inside the visual encoder to strengthen domain knowledge. At the mid level, Prefix Language Modeling on text and Masked Image Modeling on images provide mutual guidance that improves cross-modal entity embedding alignment. At the high level, a cross-modal causal intervention module applies front-door intervention to block confounders and increase interpretability. Extensive experiments demonstrate that this combined structure significantly outperforms state-of-the-art medical report methods

What carries the argument

The HTSC-CIF framework, which decomposes medical report generation into three task levels and adds a cross-modal causal intervention module that performs front-door intervention to reduce confounders.

If this is right

  • Low-level spatial alignment of entities supplies visual encoders with explicit medical domain structure that improves feature quality for lesion description.
  • Mid-level mutual guidance between prefix language modeling and masked image modeling produces tighter cross-modal entity embeddings that reduce misalignment errors.
  • High-level front-door intervention removes spurious cross-modal correlations, yielding reports that depend more on causal image features than on dataset biases.
  • Jointly applying all three levels produces higher overall performance on standard medical report generation benchmarks than methods addressing any single challenge.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same three-level decomposition could be tested on related multimodal medical tasks such as automated diagnosis prediction from combined image and text data.
  • Front-door intervention at the high level might be replaced or augmented with other causal identification strategies if front-door assumptions prove too restrictive in new datasets.
  • The hierarchical structure suggests a general pattern for multimodal generation problems where domain knowledge, alignment, and bias issues coexist.
  • If the approach generalizes, it could reduce the need for heavy post-hoc explanation techniques by building interpretability directly into the causal module.

Load-bearing premise

The specific combination of low-level spatial alignment, mid-level mutual guidance via prefix and masked modeling, and high-level front-door intervention will together resolve the three stated challenges without introducing new confounders or harming generalization.

What would settle it

Train the model on a dataset where spurious image-report correlations are deliberately strengthened while keeping entity alignment and domain knowledge constant, then measure whether report accuracy and causal robustness drop below the level of non-intervened baselines.

Figures

Figures reproduced from arXiv: 2511.02271 by Junhao Li, Yifan Ge, Yucheng Song, Zhifang Liao, Zhining Liao.

Figure 1
Figure 1. Figure 1: Multi-level task design of HTSC-CIF. challenge necessitates the exploration of automated Medi￾cal Report Generation (MRG) systems. However, current MRG also faces several challenges: 1) How to incorporate rich domain knowledge into the model to improve the ac￾curacy and reliability of the report. Medical images con￾tain a significant amount of specialized information, which often requires a deep medical ba… view at source ↗
Figure 2
Figure 2. Figure 2: The overall structure of HTSC-CIF. (a) Domain Knowledge Enhancement Module. (b) Cross-Modal Alignment Module. (c) [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Description of causal structural modeling. [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative examples of HTSC-CIF on MIMIC-CXR. [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
read the original abstract

Medical Report Generation (MRG) is a key part of modern medical diagnostics, as it automatically generates reports from radiological images to reduce radiologists' burden. However, reliable MRG models for lesion description face three main challenges: insufficient domain knowledge understanding, poor text-visual entity embedding alignment, and spurious correlations from cross-modal biases. Previous work only addresses single challenges, while this paper tackles all three via a novel hierarchical task decomposition approach, proposing the HTSC-CIF framework. HTSC-CIF classifies the three challenges into low-, mid-, and high-level tasks: 1) Low-level: align medical entity features with spatial locations to enhance domain knowledge for visual encoders; 2) Mid-level: use Prefix Language Modeling (text) and Masked Image Modeling (images) to boost cross-modal alignment via mutual guidance; 3) High-level: a cross-modal causal intervention module (via front-door intervention) to reduce confounders and improve interpretability. Extensive experiments confirm HTSC-CIF's effectiveness, significantly outperforming state-of-the-art (SOTA) MRG methods. Code will be made public upon paper acceptance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes HTSC-CIF, a hierarchical framework for medical report generation (MRG) that decomposes three challenges—insufficient domain knowledge, poor text-visual entity alignment, and spurious cross-modal correlations—into low-, mid-, and high-level tasks. Low-level aligns entity features with spatial locations in visual encoders; mid-level employs Prefix Language Modeling and Masked Image Modeling for mutual cross-modal guidance; high-level introduces a cross-modal causal intervention module based on front-door intervention to block confounders and improve interpretability. The authors claim that extensive experiments demonstrate significant outperformance over state-of-the-art MRG methods.

Significance. If the central claims hold, the hierarchical decomposition combined with explicit causal intervention offers a principled way to jointly address domain knowledge, alignment, and bias issues that prior single-challenge methods leave unresolved. The use of front-door intervention for interpretability in cross-modal generation is a distinctive technical contribution that could influence future work on reliable, bias-reduced medical report models.

major comments (2)
  1. [§3.3] §3.3 (High-level Cross-Modal Causal Intervention): The front-door identification formula is invoked with mid-level aligned entity embeddings as mediator M, yet the manuscript supplies neither a do-calculus derivation confirming P(Y|do(X)) = ∑_m P(M=m|X) ∑_x P(Y|M=m,X=x) P(X=x) nor any sensitivity analysis for the required no-unmeasured-confounding assumption between M and generated report tokens. In radiology data, where visual features and textual entities share latent clinical factors, this assumption is load-bearing for the claim that spurious correlations are reduced; without verification the performance gains could be attributable to added capacity rather than causal blocking.
  2. [§4] §4 (Experiments) and Table 2: While the abstract asserts outperformance of SOTA methods, the reported metrics, ablation studies isolating the causal module, and error analysis on entity alignment failures are not cross-referenced to the specific low/mid/high-level contributions. This makes it impossible to assess whether the hierarchical structure itself, rather than any single added component, drives the gains.
minor comments (2)
  1. [§3] Notation for the mediator variable M and the intervention operator is introduced only in the high-level subsection; an explicit equation block early in §3 would improve readability.
  2. [§3.2] The description of Prefix Language Modeling and Masked Image Modeling in the mid-level module would benefit from a short pseudocode listing or diagram showing the mutual-guidance flow.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments on our paper. We address each of the major comments below and outline the revisions we plan to make to strengthen the manuscript.

read point-by-point responses
  1. Referee: [§3.3] §3.3 (High-level Cross-Modal Causal Intervention): The front-door identification formula is invoked with mid-level aligned entity embeddings as mediator M, yet the manuscript supplies neither a do-calculus derivation confirming P(Y|do(X)) = ∑_m P(M=m|X) ∑_x P(Y|M=m,X=x) P(X=x) nor any sensitivity analysis for the required no-unmeasured-confounding assumption between M and generated report tokens. In radiology data, where visual features and textual entities share latent clinical factors, this assumption is load-bearing for the claim that spurious correlations are reduced; without verification the performance gains could be attributable to added capacity rather than causal blocking.

    Authors: We thank the referee for highlighting this point on the causal intervention. The front-door criterion is applied with mid-level aligned entity embeddings as mediator M to block spurious cross-modal correlations, following standard causal inference practice for front-door adjustment. We acknowledge that an explicit do-calculus derivation was not provided in the original manuscript. In the revision we will insert a step-by-step derivation confirming the identification formula P(Y|do(X)) = ∑_m P(M=m|X) ∑_x P(Y|M=m,X=x) P(X=x) under our hierarchical setting. Regarding the no-unmeasured-confounding assumption between M and Y, the low- and mid-level modules are explicitly designed to reduce shared latent clinical factors through entity alignment and mutual guidance; our existing ablations already separate capacity from the intervention effect. Nevertheless, we will add a dedicated discussion of this assumption together with a sensitivity analysis (e.g., via simulation bounds or proxy confounding metrics) to further substantiate that gains arise from causal blocking rather than parameter count alone. revision: yes

  2. Referee: [§4] §4 (Experiments) and Table 2: While the abstract asserts outperformance of SOTA methods, the reported metrics, ablation studies isolating the causal module, and error analysis on entity alignment failures are not cross-referenced to the specific low/mid/high-level contributions. This makes it impossible to assess whether the hierarchical structure itself, rather than any single added component, drives the gains.

    Authors: We agree that stronger explicit linkages between results and the three task levels would improve clarity. In the revised manuscript we will reorganize Section 4 to cross-reference every reported metric and ablation directly to the low-level (spatial entity alignment), mid-level (Prefix LM + Masked IM), and high-level (causal intervention) modules. We will expand the ablation tables to isolate the causal module’s incremental contribution and add a focused error analysis that attributes entity alignment failures to the mid-level task. These changes will make it possible to evaluate whether the hierarchical decomposition, rather than any isolated component, accounts for the observed improvements over SOTA methods. revision: yes

Circularity Check

0 steps flagged

No circularity: novel hierarchical modules and causal intervention introduced without self-referential reductions

full rationale

The paper presents HTSC-CIF as a new framework that decomposes MRG challenges into low-level spatial alignment, mid-level mutual guidance via prefix/masked modeling, and high-level front-door intervention. No equations, fitted parameters renamed as predictions, or self-citation chains are exhibited that would make any claimed result equivalent to its inputs by construction. Performance gains are attributed to experimental outperformance of SOTA methods rather than tautological definitions or unverified uniqueness theorems from the same authors. The derivation chain remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, mathematical axioms, or newly postulated entities; the framework description mentions modules but does not detail any fitted constants, background assumptions, or invented constructs.

pith-pipeline@v0.9.0 · 5739 in / 1195 out tokens · 59612 ms · 2026-05-18T01:41:31.446328+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

51 extracted references · 51 canonical work pages

  1. [1]

    Making the most of text semantics to improve biomedical vision–language processing

    Benedikt Boecking, Naoto Usuyama, Shruthi Bannur, Daniel C Castro, Anton Schwaighofer, Stephanie Hyland, Maria Wetscherek, Tristan Naumann, Aditya Nori, Javier Alvarez-Valle, et al. Making the most of text semantics to improve biomedical vision–language processing. InEuro- pean conference on computer vision, pages 1–21. Springer,

  2. [2]

    An causal xai diagnostic model for breast cancer based on mammography reports

    Dehua Chen, Hongjin Zhao, Jianrong He, Qiao Pan, and Weiliang Zhao. An causal xai diagnostic model for breast cancer based on mammography reports. In2021 IEEE in- ternational conference on bioinformatics and biomedicine (BIBM), pages 3341–3349. IEEE, 2021. 2

  3. [3]

    Cross-modal causal intervention for medical report generation.arXiv preprint arXiv:2303.09117, 2023

    Weixing Chen, Yang Liu, Ce Wang, Jiarui Zhu, Shen Zhao, Guanbin Li, Cheng-Lin Liu, and Liang Lin. Cross-modal causal intervention for medical report generation.arXiv preprint arXiv:2303.09117, 2023. 2, 3, 4

  4. [4]

    Cross-modal causal represen- tation learning for radiology report generation.IEEE Trans- actions on Image Processing, 34:2970–2985, 2025

    Weixing Chen, Yang Liu, Ce Wang, Jiarui Zhu, Guanbin Li, Cheng-Lin Liu, and Liang Lin. Cross-modal causal represen- tation learning for radiology report generation.IEEE Trans- actions on Image Processing, 34:2970–2985, 2025. 7, 8

  5. [5]

    Generating radiology reports via memory-driven trans- former.arXiv preprint arXiv:2010.16056, 2020

    Zhihong Chen, Yan Song, Tsung-Hui Chang, and Xiang Wan. Generating radiology reports via memory-driven trans- former.arXiv preprint arXiv:2010.16056, 2020. 8

  6. [6]

    Cross-modal memory networks for radiology report gener- ation

    Zhihong Chen, Yaling Shen, Yan Song, and Xiang Wan. Cross-modal memory networks for radiology report gener- ation. InProceedings of the 59th Annual Meeting of the As- sociation for Computational Linguistics and the 11th Inter- national Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 5904–5914, Online, 2021. Association for C...

  7. [7]

    Cross-modal memory networks for radiology report gener- ation.arXiv preprint arXiv:2204.13258, 2022

    Zhihong Chen, Yaling Shen, Yan Song, and Xiang Wan. Cross-modal memory networks for radiology report gener- ation.arXiv preprint arXiv:2204.13258, 2022. 3

  8. [8]

    Prior: Prototype representation joint learning from medical images and reports

    Pujin Cheng, Li Lin, Junyan Lyu, Yijin Huang, Wenhan Luo, and Xiaoying Tang. Prior: Prototype representation joint learning from medical images and reports. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 21361–21371, 2023. 3

  9. [9]

    Difficulties in the interpretation of chest radiography.Comparative inter- pretation of CT and standard radiography of the chest, pages 27–49, 2011

    Louke Delrue, Robert Gosselin, Bart Ilsen, An Van Lan- deghem, Johan de Mey, and Philippe Duyck. Difficulties in the interpretation of chest radiography.Comparative inter- pretation of CT and standard radiography of the chest, pages 27–49, 2011. 1

  10. [10]

    Preparing a collection of radiology examinations for distribution and re- trieval.Journal of the American Medical Informatics Asso- ciation, 23(2):304–310, 2016

    Dina Demner-Fushman, Marc D Kohli, Marc B Rosen- man, Sonya E Shooshan, Laritza Rodriguez, Sameer Antani, George R Thoma, and Clement J McDonald. Preparing a collection of radiology examinations for distribution and re- trieval.Journal of the American Medical Informatics Asso- ciation, 23(2):304–310, 2016. 6

  11. [11]

    Vision-Language Models for Medical Report Generation and Visual Question Answering: A Review,

    Iryna Hartsock and Ghulam Rasool. Vision-language models for medical report generation and visual question answering: A review.arXiv preprint arXiv:2403.02469, 2024. 2

  12. [12]

    Transfg: A trans- former architecture for fine-grained recognition

    Ju He, Jie-Neng Chen, Shuai Liu, Adam Kortylewski, Cheng Yang, Yutong Bai, and Changhu Wang. Transfg: A trans- former architecture for fine-grained recognition. InProceed- ings of the AAAI conference on artificial intelligence, pages 852–860, 2022. 6

  13. [13]

    Masked autoencoders are scalable vision learners

    Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Doll´ar, and Ross Girshick. Masked autoencoders are scalable vision learners. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16000– 16009, 2022. 4

  14. [14]

    Kiut: Knowledge-injected u-transformer for radiology report generation

    Zhongzhen Huang, Xiaofan Zhang, and Shaoting Zhang. Kiut: Knowledge-injected u-transformer for radiology report generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19809– 19818, 2023. 2, 8

  15. [15]

    arXiv preprint arXiv:2106.14463 , year=

    Saahil Jain, Ashwin Agrawal, Adriel Saporta, Steven QH Truong, Du Nguyen Duong, Tan Bui, Pierre Chambon, Yuhao Zhang, Matthew P Lungren, Andrew Y Ng, et al. Rad- graph: Extracting clinical entities and relations from radiol- ogy reports.arXiv preprint arXiv:2106.14463, 2021. 6

  16. [16]

    Promptmrg: Diagnosis-driven prompts for medical report generation

    Haibo Jin, Haoxuan Che, Yi Lin, and Hao Chen. Promptmrg: Diagnosis-driven prompts for medical report generation. In Proceedings of the AAAI Conference on Artificial Intelli- gence, pages 2607–2615, 2024. 8

  17. [17]

    Mimic-cxr, a de- identified publicly available database of chest radiographs with free-text reports.Scientific data, 6(1):317, 2019

    Alistair EW Johnson, Tom J Pollard, Seth J Berkowitz, Nathaniel R Greenbaum, Matthew P Lungren, Chih-ying Deng, Roger G Mark, and Steven Horng. Mimic-cxr, a de- identified publicly available database of chest radiographs with free-text reports.Scientific data, 6(1):317, 2019. 6

  18. [18]

    A causal perspective on dataset bias in machine learning for medical imaging.Nature Machine Intelligence, 6(2):138–146, 2024

    Charles Jones, Daniel C Castro, Fabio De Sousa Ribeiro, Ozan Oktay, Melissa McCradden, and Ben Glocker. A causal perspective on dataset bias in machine learning for medical imaging.Nature Machine Intelligence, 6(2):138–146, 2024. 3

  19. [19]

    Dynamic graph enhanced contrastive learning for chest x-ray report generation

    Mingjie Li, Bingqian Lin, Zicong Chen, Haokun Lin, Xi- aodan Liang, and Xiaojun Chang. Dynamic graph enhanced contrastive learning for chest x-ray report generation. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3334–3343, 2023. 2, 8

  20. [20]

    Unify, align and refine: Multi- level semantic alignment for radiology report generation

    Yaowei Li, Bang Yang, Xuxin Cheng, Zhihong Zhu, Hongx- iang Li, and Yuexian Zou. Unify, align and refine: Multi- level semantic alignment for radiology report generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2863–2874, 2023. 3

  21. [21]

    Exploring and distilling posterior and prior knowl- edge for radiology report generation

    Fenglin Liu, Xian Wu, Shen Ge, Wei Fan, and Yuexian Zou. Exploring and distilling posterior and prior knowl- edge for radiology report generation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 13753–13762, 2021. 2

  22. [22]

    Contrastive attention for automatic chest x-ray report generation.arXiv preprint arXiv:2106.06965, 2021

    Fenglin Liu, Changchang Yin, Xian Wu, Shen Ge, Yuex- ian Zou, Ping Zhang, and Xu Sun. Contrastive attention for automatic chest x-ray report generation.arXiv preprint arXiv:2106.06965, 2021. 2

  23. [23]

    Auto-encoding knowledge graph for unsupervised medical report generation.Advances in Neural Information Process- ing Systems, 34:16266–16279, 2021

    Fenglin Liu, Chenyu You, Xian Wu, Shen Ge, Xu Sun, et al. Auto-encoding knowledge graph for unsupervised medical report generation.Advances in Neural Information Process- ing Systems, 34:16266–16279, 2021. 3

  24. [24]

    In-context learning for zero-shot medical re- port generation

    Rui Liu, Mingjie Li, Shen Zhao, Ling Chen, Xiaojun Chang, and Lina Yao. In-context learning for zero-shot medical re- port generation. InProceedings of the 32nd ACM Interna- tional Conference on Multimedia, pages 8721–8730, 2024. 8

  25. [25]

    Reinforced cross-modal alignment for radiology report generation

    Han Qin and Yan Song. Reinforced cross-modal alignment for radiology report generation. InFindings of the Associa- tion for Computational Linguistics: ACL 2022, pages 448– 458, 2022. 2

  26. [26]

    Automatic radiology reports generation via memory align- ment network

    Hongyu Shen, Mingtao Pei, Juncai Liu, and Zhaoxing Tian. Automatic radiology reports generation via memory align- ment network. InProceedings of the AAAI Conference on Artificial Intelligence, pages 4776–4783, 2024. 8

  27. [27]

    Interactive and explainable region-guided radiol- ogy report generation

    Tim Tanida, Philip M ¨uller, Georgios Kaissis, and Daniel Rueckert. Interactive and explainable region-guided radiol- ogy report generation. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 7433–7442, 2023. 8

  28. [28]

    Memory- based cross-modal semantic alignment network for radiology report generation.IEEE Journal of Biomedical and Health Informatics, 2024

    Yitian Tao, Liyan Ma, Jing Yu, and Han Zhang. Memory- based cross-modal semantic alignment network for radiology report generation.IEEE Journal of Biomedical and Health Informatics, 2024. 3

  29. [29]

    Xraygpt: Chest radiographs summarization using medical vision-language models

    Omkar Thawkar, Abdelrahman Shaker, Sahal Shaji Mullap- pilly, Hisham Cholakkal, Rao Muhammad Anwer, Salman Khan, Jorma Laaksonen, and Fahad Shahbaz Khan. Xraygpt: Chest radiographs summarization using medical vision- language models.arXiv preprint arXiv:2306.07971, 2023. 8

  30. [30]

    Towards gen- eralist biomedical ai.Nejm Ai, 1(3):AIoa2300138, 2024

    Tao Tu, Shekoofeh Azizi, Danny Driess, Mike Schaeker- mann, Mohamed Amin, Pi-Chuan Chang, Andrew Carroll, Charles Lau, Ryutaro Tanno, Ira Ktena, et al. Towards gen- eralist biomedical ai.Nejm Ai, 1(3):AIoa2300138, 2024. 8

  31. [31]

    doi:10.48550/arxiv.2206.05498 , arxivId =

    Athanasios Vlontzos, Daniel Rueckert, and Bernhard Kainz. A review of causality for learning algorithms in medical im- age analysis.arXiv preprint arXiv:2206.05498, 2022. 3

  32. [32]

    Entity, relation, and event extraction with contextualized span representations.arXiv preprint arXiv:1909.03546, 2019

    David Wadden, Ulme Wennberg, Yi Luan, and Han- naneh Hajishirzi. Entity, relation, and event extraction with contextualized span representations.arXiv preprint arXiv:1909.03546, 2019. 7

  33. [33]

    Cross-modal pro- totype driven network for radiology report generation

    Jun Wang, Abhir Bhalerao, and Yulan He. Cross-modal pro- totype driven network for radiology report generation. In European Conference on Computer Vision, pages 563–579. Springer, 2022. 3

  34. [34]

    Causal attention for unbiased visual recognition

    Tan Wang, Chang Zhou, Qianru Sun, and Hanwang Zhang. Causal attention for unbiased visual recognition. InProceed- ings of the IEEE/CVF International Conference on Com- puter Vision, pages 3091–3100, 2021. 6

  35. [35]

    Tienet: Text-image embedding net- work for common thorax disease classification and reporting in chest x-rays

    Xiaosong Wang, Yifan Peng, Le Lu, Zhiyong Lu, and Ronald M Summers. Tienet: Text-image embedding net- work for common thorax disease classification and reporting in chest x-rays. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 9049–9058,

  36. [36]

    arXiv preprint arXiv:2108.10904 , year=

    Zirui Wang, Jiahui Yu, Adams Wei Yu, Zihang Dai, Yulia Tsvetkov, and Yuan Cao. Simvlm: Simple visual language model pretraining with weak supervision.arXiv preprint arXiv:2108.10904, 2021. 4

  37. [37]

    Metransformer: Radiology report generation by transformer with multiple learnable expert tokens

    Zhanyu Wang, Lingqiao Liu, Lei Wang, and Luping Zhou. Metransformer: Radiology report generation by transformer with multiple learnable expert tokens. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11558–11567, 2023. 8

  38. [38]

    R2gengpt: Radiology report generation with frozen llms

    Zhanyu Wang, Lingqiao Liu, Lei Wang, and Luping Zhou. R2gengpt: Radiology report generation with frozen llms. Meta-Radiology, 1(3):100033, 2023. 8

  39. [39]

    Medklip: Medical knowledge enhanced language-image pre-training for x-ray diagnosis

    Chaoyi Wu, Xiaoman Zhang, Ya Zhang, Yanfeng Wang, and Weidi Xie. Medklip: Medical knowledge enhanced language-image pre-training for x-ray diagnosis. InProceed- ings of the IEEE/CVF International Conference on Com- puter Vision, pages 21372–21383, 2023. 2

  40. [40]

    Causal infer- ence in the medical domain: a survey.Applied Intelligence, pages 1–24, 2024

    Xing Wu, Shaoqi Peng, Jingwen Li, Jian Zhang, Qun Sun, Weimin Li, Quan Qian, Yue Liu, and Yike Guo. Causal infer- ence in the medical domain: a survey.Applied Intelligence, pages 1–24, 2024. 3

  41. [41]

    A survey on incorporating do- main knowledge into deep learning for medical image anal- ysis.Medical Image Analysis, 69:101985, 2021

    Xiaozheng Xie, Jianwei Niu, Xuefeng Liu, Zhengsu Chen, Shaojie Tang, and Shui Yu. A survey on incorporating do- main knowledge into deep learning for medical image anal- ysis.Medical Image Analysis, 69:101985, 2021. 3

  42. [42]

    Vision-knowledge fusion model for multi-domain medical report generation.Infor- mation Fusion, 97:101817, 2023

    Dexuan Xu, Huashi Zhu, Yu Huang, Zhi Jin, Weiping Ding, Hang Li, and Menglong Ran. Vision-knowledge fusion model for multi-domain medical report generation.Infor- mation Fusion, 97:101817, 2023. 3

  43. [43]

    Show, attend and tell: Neural image caption gen- eration with visual attention

    Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. Show, attend and tell: Neural image caption gen- eration with visual attention. InInternational conference on machine learning, pages 2048–2057. PMLR, 2015. 5

  44. [44]

    Attributed abnor- mality graph embedding for clinically accurate x-ray report generation.IEEE Transactions on Medical Imaging, 42(8): 2211–2222, 2023

    Sixing Yan, William K Cheung, Keith Chiu, Terence M Tong, Ka Chun Cheung, and Simon See. Attributed abnor- mality graph embedding for clinically accurate x-ray report generation.IEEE Transactions on Medical Imaging, 42(8): 2211–2222, 2023. 3

  45. [45]

    Knowledge matters: Chest radiology report genera- tion with general and specific knowledge.Medical image analysis, 80:102510, 2022

    Shuxin Yang, Xian Wu, Shen Ge, S Kevin Zhou, and Li Xiao. Knowledge matters: Chest radiology report genera- tion with general and specific knowledge.Medical image analysis, 80:102510, 2022. 3

  46. [46]

    Radiology report generation with a learned knowledge base and multi-modal alignment.Med- ical Image Analysis, 86:102798, 2023

    Shuxin Yang, Xian Wu, Shen Ge, Zhuozhao Zheng, S Kevin Zhou, and Li Xiao. Radiology report generation with a learned knowledge base and multi-modal alignment.Med- ical Image Analysis, 86:102798, 2023. 3, 8

  47. [47]

    Deconfounded image captioning: A causal retrospect.IEEE Transac- tions on Pattern Analysis and Machine Intelligence, 45(11): 12996–13010, 2021

    Xu Yang, Hanwang Zhang, and Jianfei Cai. Deconfounded image captioning: A causal retrospect.IEEE Transac- tions on Pattern Analysis and Machine Intelligence, 45(11): 12996–13010, 2021. 4

  48. [48]

    Aligntransformer: Hierarchical alignment of visual regions and disease tags for medical report genera- tion

    Di You, Fenglin Liu, Shen Ge, Xiaoxia Xie, Jing Zhang, and Xian Wu. Aligntransformer: Hierarchical alignment of visual regions and disease tags for medical report genera- tion. InMedical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Pro- ceedings, Part III 2...

  49. [49]

    Anatomy-guided weakly- supervised abnormality localization in chest x-rays

    Ke Yu, Shantanu Ghosh, Zhexiong Liu, Christopher Deible, and Kayhan Batmanghelich. Anatomy-guided weakly- supervised abnormality localization in chest x-rays. InIn- ternational Conference on Medical Image Computing and Computer-Assisted Intervention, pages 658–668. Springer,

  50. [50]

    Au- tomatic radiology report generation based on multi-view image fusion and medical concept enrichment

    Jianbo Yuan, Haofu Liao, Rui Luo, and Jiebo Luo. Au- tomatic radiology report generation based on multi-view image fusion and medical concept enrichment. InMedi- cal Image Computing and Computer Assisted Intervention– MICCAI 2019: 22nd International Conference, Shenzhen, China, October 13–17, 2019, Proceedings, Part VI 22, pages 721–729. Springer, 2019. 3

  51. [51]

    When radiology report generation meets knowledge graph

    Yixiao Zhang, Xiaosong Wang, Ziyue Xu, Qihang Yu, Alan Yuille, and Daguang Xu. When radiology report generation meets knowledge graph. InProceedings of the AAAI con- ference on artificial intelligence, pages 12910–12917, 2020. 3