pith. machine review for the scientific record. sign in

arxiv: 2604.27559 · v1 · submitted 2026-04-30 · 💻 cs.CV · cs.AI

Recognition: unknown

RIHA: Report-Image Hierarchical Alignment for Radiology Report Generation

Authors on Pith no claims yet

Pith reviewed 2026-05-07 09:54 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords radiology report generationhierarchical alignmentoptimal transportchest X-raycross-modal alignmenttransformerclinical efficacynatural language generation
0
0 comments X

The pith

RIHA generates more accurate radiology reports by aligning image features with report structure at paragraph, sentence, and word levels.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces RIHA, a transformer-based framework for generating radiology reports from medical images that tackles fine-grained alignment between visual features and the structured hierarchies in long clinical texts. Prior methods often flatten reports into simple sequences, which loses the paragraph-to-sentence-to-word organization essential for capturing diagnostic nuances. RIHA counters this with visual and text feature pyramids plus optimal transport to match features across those levels, plus relative positional encoding in the decoder for tighter token alignment. On the IU-Xray and MIMIC-CXR chest X-ray benchmarks it records gains over prior models in both language quality scores and clinical correctness measures. A reader would care because better automated reports could cut radiologist workload while lowering the chance of missed findings.

Core claim

RIHA is an end-to-end framework that performs multi-level alignment between radiological images and their corresponding reports across paragraph, sentence, and word levels. It introduces a Visual Feature Pyramid to extract multi-scale visual features and a Text Feature Pyramid to represent multi-granularity textual structures, integrated via a Cross-modal Hierarchical Alignment module that uses optimal transport. Relative Positional Encoding is added to the decoder to enhance token-level alignment, leading to superior performance on benchmark datasets.

What carries the argument

The Cross-modal Hierarchical Alignment module, which leverages optimal transport to align multi-scale visual features from the Visual Feature Pyramid with multi-granularity textual structures from the Text Feature Pyramid at paragraph, sentence, and word levels.

If this is right

  • Outperforms existing state-of-the-art models in natural language generation metrics on IU-Xray and MIMIC-CXR
  • Records higher clinical efficacy metric scores on the same two datasets
  • Captures nuanced semantics in clinical narratives through precise cross-modal mapping
  • Strengthens token-level alignment between visual features and generated text via relative positional encoding

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same multi-level alignment idea could be tested on report generation from CT or MRI volumes where spatial hierarchies are even more pronounced
  • Embedding such a model in clinical workflows might shorten report turnaround times by surfacing structured findings automatically
  • The emphasis on report structure implies that hierarchical alignment techniques may transfer to other image-to-long-text tasks such as pathology slide captioning

Load-bearing premise

That multi-level feature pyramids combined with optimal transport will produce clinically faithful alignments without introducing or omitting critical diagnostic details that flat-sequence models already handle adequately.

What would settle it

A side-by-side review of generated reports on test cases containing subtle or multiple findings, checking whether RIHA omits or fabricates a key abnormality that a non-hierarchical baseline correctly includes or excludes.

Figures

Figures reproduced from arXiv: 2604.27559 by Conghao Xiong, Si Yong Yeo, Xulei Yang, Yang Yu, Yucheng Chen, Yufei Shi.

Figure 1
Figure 1. Figure 1: A chest X-ray image alongside its corresponding radiology report. The boxes on the right display, respectively, the individual sentences and the extracted keywords from the report. Multi-level visual-textual alignments are indicated using matching colours to highlight their associations. coherent report generation. Existing approaches to address these alignment challenges broadly fall into two categories: … view at source ↗
Figure 2
Figure 2. Figure 2: The architecture of RIHA: An image is fed into the VFP Extractor to obtain shallow, middle, and high-level features. The multi-granularity text features of paragraph, sentence, and word-level features are extracted by the TFP extractor. Multi-granularity visual and textual features are then sent into CHA for hierarchical alignment. After that, refined visual and textual features are fed into a transformer … view at source ↗
Figure 4
Figure 4. Figure 4: An illustration comparing relative and absolute positional embeddings in transformers, where the clipped value k = 3 represents the maximum allowable relative position distance. a) Absolute position embedding weights. b) Relative position embedding weights. c) The transformer encoder structure with relative position embeddings. For further details, see [68]. A. Datasets and Evaluation Metrics IU-Xray [69] … view at source ↗
Figure 5
Figure 5. Figure 5: Examples of generated reports from the MIMIC-CXR testing subset using the baseline model and our proposed RIHA method. Identical findings in the ground truth (GT) and generated reports are highlighted with matching colors, demonstrating the superior performance of our approach view at source ↗
Figure 6
Figure 6. Figure 6: Example reports generated by incrementally incorporating the proposed modules into the baseline model are shown. Key medical terms are highlighted in different colors to clearly differentiate model performance view at source ↗
Figure 7
Figure 7. Figure 7: Attention map visualizations for various keywords from the Baseline and RIHA models reveal that RIHA assigns more precise attention regions, highlighting its improved focus for each keyword. illustrating our model’s capacity to align textual descriptions with corresponding anatomical regions. These heat maps show precise localization of pleural effusion, cardiac structures, and lung fields, validating RIHA… view at source ↗
read the original abstract

Radiology report generation (RRG) has emerged as a promising approach to alleviate radiologists' workload and reduce human errors by automatically generating diagnostic reports from medical images. A key challenge in RRG is achieving fine-grained alignment between complex visual features and the hierarchical structure of long-form radiology reports. Although recent methods have improved image-text representation learning, they often treat reports as flat sequences, overlooking their structured sections and semantic hierarchies. This simplification hinders precise cross-modal alignment and weakens RRG accuracy. To address this challenge, we propose RIHA (Report-Image Hierarchical Alignment Transformer), a novel end-to-end framework that performs multi-level alignment between radiological images and their corresponding reports across paragraph, sentence, and word levels. This hierarchical alignment enables more precise cross-modal mapping, essential for capturing the nuanced semantics embedded in clinical narratives. Specifically, RIHA introduces a Visual Feature Pyramid (VFP) to extract multi-scale visual features and a Text Feature Pyramid (TFP) to represent multi-granularity textual structures. These components are integrated through a Cross-modal Hierarchical Alignment (CHA) module, leveraging optimal transport to effectively align visual and textual features across various levels. Furthermore, we incorporate Relative Positional Encoding (RPE) into the decoder to model spatial and semantic relationships among tokens, enhancing the token-level alignment between visual features and generated text. Extensive experiments on two benchmark chest X-ray datasets, IU-Xray and MIMIC-CXR, demonstrate that RIHA outperforms existing state-of-the-art models in both natural language generation and clinical efficacy metrics.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes RIHA, a transformer-based framework for radiology report generation that extracts multi-scale visual features via a Visual Feature Pyramid (VFP), multi-granularity textual features via a Text Feature Pyramid (TFP), and aligns them across paragraph/sentence/word levels using a Cross-modal Hierarchical Alignment (CHA) module based on optimal transport; it further adds Relative Positional Encoding in the decoder. Experiments on IU-Xray and MIMIC-CXR are reported to show gains over prior SOTA in both NLG metrics and clinical efficacy metrics.

Significance. If the hierarchical alignment demonstrably improves clinical fidelity without introducing omissions, the work would usefully extend representation learning for structured medical text generation. The integration of optimal transport across explicit pyramid levels is a concrete technical step beyond flat-sequence baselines.

major comments (2)
  1. [CHA module and Experiments] The central claim that VFP+TFP+CHA yields clinically faithful alignments superior to flat-sequence models rests on aggregate NLG and clinical metrics; however, the manuscript provides no entity-level error analysis (e.g., per-finding omission rates for lesions, negations, or rare descriptors) to rule out the possibility that global optimal transport improves averages while still dropping sparse high-stakes details that flat models already capture.
  2. [Experiments] The experimental section must report baseline re-implementations, statistical significance tests, error bars across runs, and explicit checks for data leakage between train/test splits on both IU-Xray and MIMIC-CXR; without these, the reported outperformance cannot be taken as load-bearing evidence for the hierarchical design.
minor comments (2)
  1. [Method] Notation for the three pyramid levels (paragraph/sentence/word) and the precise formulation of the optimal-transport cost matrix should be introduced with a single equation block rather than scattered prose.
  2. [Figures] Figure captions for the overall architecture and feature-pyramid diagrams should explicitly label the input/output tensors at each level to aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback, which helps clarify the need for more granular validation of our hierarchical alignment claims and stronger experimental protocols. We address each major comment below and will revise the manuscript to incorporate the requested elements where feasible.

read point-by-point responses
  1. Referee: [CHA module and Experiments] The central claim that VFP+TFP+CHA yields clinically faithful alignments superior to flat-sequence models rests on aggregate NLG and clinical metrics; however, the manuscript provides no entity-level error analysis (e.g., per-finding omission rates for lesions, negations, or rare descriptors) to rule out the possibility that global optimal transport improves averages while still dropping sparse high-stakes details that flat models already capture.

    Authors: We agree that aggregate metrics alone leave open the possibility of overlooking omissions in sparse but clinically critical elements. In the revised manuscript we will add an entity-level error analysis section that reports per-finding omission rates for lesions, negations, and rare descriptors on both IU-Xray and MIMIC-CXR, directly comparing RIHA against the strongest flat-sequence baselines to quantify whether the hierarchical optimal-transport alignment reduces such omissions. revision: yes

  2. Referee: [Experiments] The experimental section must report baseline re-implementations, statistical significance tests, error bars across runs, and explicit checks for data leakage between train/test splits on both IU-Xray and MIMIC-CXR; without these, the reported outperformance cannot be taken as load-bearing evidence for the hierarchical design.

    Authors: We acknowledge that these experimental details are necessary for the results to serve as load-bearing evidence. The revised experimental section will include: (i) re-implementations of the primary baselines using the same training protocols, (ii) statistical significance tests (paired t-tests with p-values) on the NLG and clinical metrics, (iii) error bars computed over multiple random seeds, and (iv) explicit data-leakage verification confirming that no patient or study overlap exists between train and test splits on either dataset. revision: yes

Circularity Check

0 steps flagged

No circularity: RIHA's multi-level alignment is an independent architectural proposal evaluated empirically

full rationale

The paper introduces RIHA as a novel end-to-end framework combining Visual Feature Pyramid (VFP), Text Feature Pyramid (TFP), Cross-modal Hierarchical Alignment (CHA) via optimal transport, and Relative Positional Encoding (RPE). Performance claims rest on standard benchmark evaluations (IU-Xray, MIMIC-CXR) for NLG and clinical metrics, with no equations or derivations that reduce predictions to fitted inputs, self-definitions, or self-citation chains. The hierarchical alignment is motivated externally by limitations of flat-sequence models rather than being tautological with its own components. No load-bearing step equates any result to its construction inputs.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The framework rests on standard deep-learning assumptions plus the domain-specific claim that report hierarchy can be captured by feature pyramids and aligned via optimal transport; no new physical entities are introduced.

free parameters (2)
  • Feature pyramid scales and dimensions
    The number of levels and channel sizes in VFP and TFP are architectural choices tuned on validation data.
  • Optimal transport regularization parameters
    Parameters controlling the transport cost and entropy regularization are learned or selected during training.
axioms (2)
  • domain assumption Optimal transport provides an effective mechanism for cross-modal alignment at multiple granularities
    Invoked in the CHA module description without further justification in the abstract.
  • domain assumption Relative positional encoding improves token-level alignment between visual features and generated text
    Stated as an enhancement in the decoder without derivation.

pith-pipeline@v0.9.0 · 5590 in / 1402 out tokens · 50397 ms · 2026-05-07T09:54:46.644493+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

85 extracted references · 13 canonical work pages

  1. [1]

    The shortage of radiographers: A global crisis in healthcare,

    K. Konstantinidis, “The shortage of radiographers: A global crisis in healthcare,”Journal of medical imaging and radiation sciences, vol. 55, no. 4, p. 101333, 2024

  2. [2]

    Fine-grained image-text alignment in medical imaging enables explainable cyclic image-report generation,

    W. Chen, L. Shen, J. Lin, J. Luo, X. Li, and Y . Yuan, “Fine-grained image-text alignment in medical imaging enables explainable cyclic image-report generation,”arXiv preprint arXiv:2312.08078, 2023

  3. [3]

    From vision to text: A comprehensive review of natural image captioning in medical diagnosis and radiology report generation,

    G. Reale-Nosei, E. Amador-Dom ´ınguez, and E. Serrano, “From vision to text: A comprehensive review of natural image captioning in medical diagnosis and radiology report generation,”Medical Image Analysis, p. 103264, 2024

  4. [4]

    Vision-language models for medical report generation and visual question answering: A review,

    I. Hartsock and G. Rasool, “Vision-language models for medical report generation and visual question answering: A review,”Frontiers in Artificial Intelligence, vol. 7, p. 1430984, 2024

  5. [5]

    Show, attend and tell: Neural image caption generation with visual attention,

    K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, and Y . Bengio, “Show, attend and tell: Neural image caption generation with visual attention,” inInternational conference on machine learning. PMLR, 2015, pp. 2048–2057

  6. [6]

    On the automatic generation of medical imaging reports,

    B. Jing, P. Xie, and E. Xing, “On the automatic generation of medical imaging reports,”arXiv preprint arXiv:1711.08195, 2017

  7. [7]

    Mlip: Enhancing medical visual representation with divergence encoder and knowledge-guided contrastive learning,

    Z. Li, L. T. Yang, B. Ren, X. Nie, Z. Gao, C. Tan, and S. Z. Li, “Mlip: Enhancing medical visual representation with divergence encoder and knowledge-guided contrastive learning,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 11 704–11 714

  8. [8]

    Diverse and coherent paragraph generation from images,

    M. Chatterjee and A. G. Schwing, “Diverse and coherent paragraph generation from images,” inProceedings of the European conference on computer vision (ECCV), 2018, pp. 729–744

  9. [9]

    Chexpert: A large chest radiograph dataset with uncertainty labels and expert comparison,

    J. Irvin, P. Rajpurkar, M. Ko, Y . Yu, S. Ciurea-Ilcus, C. Chute, H. Mark- lund, B. Haghgoo, R. Ball, K. Shpanskayaet al., “Chexpert: A large chest radiograph dataset with uncertainty labels and expert comparison,” inProceedings of the AAAI conference on artificial intelligence, vol. 33, no. 01, 2019, pp. 590–597

  10. [10]

    Cross-modal prototype driven network for radiology report generation,

    J. Wang, A. Bhalerao, and Y . He, “Cross-modal prototype driven network for radiology report generation,” inEuropean Conference on Computer Vision. Springer, 2022, pp. 563–579

  11. [11]

    Generating radiology re- ports via memory-driven transformer,

    Z. Chen, Y . Song, T.-H. Chang, and X. Wan, “Generating radiology re- ports via memory-driven transformer,”arXiv preprint arXiv:2010.16056, 2020

  12. [12]

    Camanet: class activation map guided attention network for radiology report generation,

    J. Wang, A. Bhalerao, T. Yin, S. See, and Y . He, “Camanet: class activation map guided attention network for radiology report generation,” IEEE Journal of Biomedical and Health Informatics, vol. 28, no. 4, pp. 2199–2210, 2024

  13. [13]

    Medunifier: Unifying vision-and-language pre-training on medical data with vision generation task using discrete visual representations,

    Z. Zhang, Y . Yu, Y . Chen, X. Yang, and S. Y . Yeo, “Medunifier: Unifying vision-and-language pre-training on medical data with vision generation task using discrete visual representations,”arXiv preprint arXiv:2503.01019, 2025

  14. [14]

    Cross-modal memory networks for radiology report generation,

    Z. Chen, Y . Shen, Y . Song, and X. Wan, “Cross-modal memory networks for radiology report generation,”arXiv preprint arXiv:2204.13258, 2022

  15. [15]

    Dynamic graph enhanced contrastive learning for chest x-ray report generation,

    M. Li, B. Lin, Z. Chen, H. Lin, X. Liang, and X. Chang, “Dynamic graph enhanced contrastive learning for chest x-ray report generation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 3334–3343

  16. [16]

    Kiut: Knowledge-injected u- transformer for radiology report generation,

    Z. Huang, X. Zhang, and S. Zhang, “Kiut: Knowledge-injected u- transformer for radiology report generation,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 19 809–19 818

  17. [17]

    Cross- modal clinical graph transformer for ophthalmic report generation,

    M. Li, W. Cai, K. Verspoor, S. Pan, X. Liang, and X. Chang, “Cross- modal clinical graph transformer for ophthalmic report generation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 20 656–20 665

  18. [18]

    Automatic generation of medical report with knowledge graph,

    H. Zhao, J. Chen, L. Huang, T. Yang, W. Ding, and C. Li, “Automatic generation of medical report with knowledge graph,” inProceedings of the 2021 10th International Conference on Computing and Pattern Recognition, 2021, pp. 1–1

  19. [19]

    When radiology report generation meets knowledge graph,

    Y . Zhang, X. Wang, Z. Xu, Q. Yu, A. Yuille, and D. Xu, “When radiology report generation meets knowledge graph,” inProceedings of the AAAI conference on artificial intelligence, vol. 34, no. 07, 2020, pp. 12 910–12 917

  20. [20]

    Radiology report generation with a learned knowledge base and multi-modal alignment,

    S. Yang, X. Wu, S. Ge, Z. Zheng, S. K. Zhou, and L. Xiao, “Radiology report generation with a learned knowledge base and multi-modal alignment,”Medical Image Analysis, vol. 86, p. 102798, 2023

  21. [21]

    Knowledge matters: Chest radiology report generation with general and specific knowledge,

    S. Yang, X. Wu, S. Ge, S. K. Zhou, and L. Xiao, “Knowledge matters: Chest radiology report generation with general and specific knowledge,” Medical image analysis, vol. 80, p. 102510, 2022

  22. [22]

    Unify, align and refine: Multi-level semantic alignment for radiology report generation,

    Y . Li, B. Yang, X. Cheng, Z. Zhu, H. Li, and Y . Zou, “Unify, align and refine: Multi-level semantic alignment for radiology report generation,” inProceedings of the IEEE/CVF international conference on computer vision, 2023, pp. 2863–2874

  23. [23]

    Multi-grained radiology report generation with sentence-level image-language contrastive learning,

    A. Liu, Y . Guo, J.-h. Yong, and F. Xu, “Multi-grained radiology report generation with sentence-level image-language contrastive learning,” IEEE Transactions on Medical Imaging, vol. 43, no. 7, pp. 2657–2669, 2024

  24. [24]

    Level set segmentation with robust image gradient energy and statistical shape prior,

    S. Y . Yeo, X. Xie, I. Sazonov, and P. Nithiarasu, “Level set segmentation with robust image gradient energy and statistical shape prior,” in2011 18th IEEE International Conference on Image Processing. IEEE, 2011, pp. 3397–3400

  25. [25]

    Cardiac image segmentation by random walks with dynamic shape constraint,

    X. Yang, Y . Su, R. Duan, H. Fan, S. Y . Yeo, C. Lim, L. Zhong, and R. S. Tan, “Cardiac image segmentation by random walks with dynamic shape constraint,”IET Computer Vision, vol. 10, no. 1, pp. 79–86, 2016

  26. [26]

    A novel multi-task deep learning model for skin lesion segmentation and classification,

    X. Yang, Z. Zeng, S. Y . Yeo, C. Tan, H. L. Tey, and Y . Su, “A novel multi-task deep learning model for skin lesion segmentation and classification,”arXiv preprint arXiv:1703.01025, 2017

  27. [27]

    Effdnet: A scribble- supervised medical image segmentation method with enhanced fore- ground feature discrimination,

    J. Liu, S. Y . Tan, X. Yang, Y . Xu, and S. Y . Yeo, “Effdnet: A scribble- supervised medical image segmentation method with enhanced fore- ground feature discrimination,” inInternational Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, 2025, pp. 194–204

  28. [28]

    Knowing when to look: Adaptive attention via a visual sentinel for image captioning,

    J. Lu, C. Xiong, D. Parikh, and R. Socher, “Knowing when to look: Adaptive attention via a visual sentinel for image captioning,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 375–383

  29. [29]

    Bottom-up and top-down attention for image captioning and visual question answering,

    P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, and L. Zhang, “Bottom-up and top-down attention for image captioning and visual question answering,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 6077–6086

  30. [30]

    Attention is all you need,

    A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,”Advances in neural information processing systems, vol. 30, 2017

  31. [31]

    Context based text-generation using lstm networks,

    S. Santhanam, “Context based text-generation using lstm networks,” arXiv preprint arXiv:2005.00048, 2020

  32. [32]

    Attention based automated radiology report generation using cnn and lstm,

    M. Sirshar, M. F. K. Paracha, M. U. Akram, N. S. Alghamdi, S. Z. Y . Zaidi, and T. Fatima, “Attention based automated radiology report generation using cnn and lstm,”Plos one, vol. 17, no. 1, p. e0262209, 2022

  33. [33]

    Self- critical sequence training for image captioning,

    S. J. Rennie, E. Marcheret, Y . Mroueh, J. Ross, and V . Goel, “Self- critical sequence training for image captioning,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 7008–7024

  34. [34]

    Towards diverse and natural image descriptions via a conditional gan,

    B. Dai, S. Fidler, R. Urtasun, and D. Lin, “Towards diverse and natural image descriptions via a conditional gan,” inProceedings of the IEEE international conference on computer vision, 2017, pp. 2970–2979

  35. [35]

    Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation,

    J. Li, D. Li, C. Xiong, and S. Hoi, “Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation,” inInternational conference on machine learning. PMLR, 2022, pp. 12 888–12 900

  36. [36]

    Oscar: Object-semantics aligned pre-training for vision-language tasks,

    X. Li, X. Yin, C. Li, P. Zhang, X. Hu, L. Zhang, L. Wang, H. Hu, L. Dong, F. Weiet al., “Oscar: Object-semantics aligned pre-training for vision-language tasks,” inComputer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXX

  37. [37]

    Springer, 2020, pp. 121–137

  38. [38]

    Simvlm: Sim- ple visual language model pretraining with weak supervision

    Z. Wang, J. Yu, A. W. Yu, Z. Dai, Y . Tsvetkov, and Y . Cao, “Simvlm: Simple visual language model pretraining with weak supervision,”arXiv preprint arXiv:2108.10904, 2021

  39. [39]

    Radiographic reports generation via retrieval enhanced cross-modal fusion,

    X. Hou, Y . Luo, W. Song, Y . Guo, W. You, and S. Li, “Radiographic reports generation via retrieval enhanced cross-modal fusion,” in2024 IEEE International Conference on Bioinformatics and Biomedicine (BIBM). IEEE, 2024, pp. 2032–2039

  40. [40]

    Spatio-temporal and retrieval-augmented modelling for chest x-ray report generation,

    Y . Yang, X. You, K. Zhang, Z. Fu, X. Wang, J. Ding, J. Sun, Z. Yu, Q. Huang, W. Hanet al., “Spatio-temporal and retrieval-augmented modelling for chest x-ray report generation,”IEEE Transactions on Medical Imaging, 2025

  41. [41]

    Tienet: Text- image embedding network for common thorax disease classification and reporting in chest x-rays,

    X. Wang, Y . Peng, L. Lu, Z. Lu, and R. M. Summers, “Tienet: Text- image embedding network for common thorax disease classification and reporting in chest x-rays,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 9049–9058

  42. [42]

    An organ-aware diagnosis framework for radiology report generation,

    S. Li, P. Qiao, L. Wang, M. Ning, L. Yuan, Y . Zheng, and J. Chen, “An organ-aware diagnosis framework for radiology report generation,” 14 IEEE TRANSACTIONS AND JOURNALS TEMPLATE IEEE Transactions on Medical Imaging, vol. 43, no. 12, pp. 4253–4265, 2024

  43. [43]

    Diffusion networks with task-specific noise control for radiology report generation,

    Y . Tian, F. Xia, and Y . Song, “Diffusion networks with task-specific noise control for radiology report generation,” inProceedings of the 32nd ACM International Conference on Multimedia, 2024, pp. 1771– 1780

  44. [44]

    Improving radiology report generation with multi-grained abnormality prediction,

    Y . Jin, W. Chen, Y . Tian, Y . Song, and C. Yan, “Improving radiology report generation with multi-grained abnormality prediction,”Neurocom- puting, vol. 600, p. 128122, 2024

  45. [45]

    Multi- granularity cross-modal alignment for generalized medical visual repre- sentation learning,

    F. Wang, Y . Zhou, S. Wang, V . Vardhanabhuti, and L. Yu, “Multi- granularity cross-modal alignment for generalized medical visual repre- sentation learning,”Advances in neural information processing systems, vol. 35, pp. 33 536–33 549, 2022

  46. [46]

    Unraveling cross- modality knowledge conflicts in large vision-language models,

    T. Zhu, Q. Liu, F. Wang, Z. Tu, and M. Chen, “Unraveling cross- modality knowledge conflicts in large vision-language models,”arXiv preprint arXiv:2410.03659, 2024

  47. [47]

    Clip-adapter: Better vision-language models with feature adapters,

    P. Gao, S. Geng, R. Zhang, T. Ma, R. Fang, Y . Zhang, H. Li, and Y . Qiao, “Clip-adapter: Better vision-language models with feature adapters,” International Journal of Computer Vision, vol. 132, no. 2, pp. 581–595, 2024

  48. [48]

    Achieving cross modal generalization with multimodal unified representation,

    Y . Xia, H. Huang, J. Zhu, and Z. Zhao, “Achieving cross modal generalization with multimodal unified representation,”Advances in Neural Information Processing Systems, vol. 36, pp. 63 529–63 541, 2023

  49. [49]

    Multimodal transformer for unaligned multimodal language sequences,

    Y .-H. H. Tsai, S. Bai, P. P. Liang, J. Z. Kolter, L.-P. Morency, and R. Salakhutdinov, “Multimodal transformer for unaligned multimodal language sequences,” inProceedings of the conference. Association for computational linguistics. Meeting, vol. 2019, 2019, p. 6558

  50. [50]

    Disentangled representation learning for multimodal emotion recognition,

    D. Yang, S. Huang, H. Kuang, Y . Du, and L. Zhang, “Disentangled representation learning for multimodal emotion recognition,” inProceed- ings of the 30th ACM international conference on multimedia, 2022, pp. 1642–1651

  51. [51]

    Modality translation- based multimodal sentiment analysis under uncertain missing modali- ties,

    Z. Liu, B. Zhou, D. Chu, Y . Sun, and L. Meng, “Modality translation- based multimodal sentiment analysis under uncertain missing modali- ties,”Information Fusion, vol. 101, p. 101973, 2024

  52. [52]

    Multimodal disentanglement variational autoencoders for zero-shot cross-modal re- trieval,

    J. Tian, K. Wang, X. Xu, Z. Cao, F. Shen, and H. T. Shen, “Multimodal disentanglement variational autoencoders for zero-shot cross-modal re- trieval,” inProceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2022, pp. 960– 969

  53. [53]

    Disentanglement translation network for multimodal sentiment analysis,

    Y . Zeng, W. Yan, S. Mai, and H. Hu, “Disentanglement translation network for multimodal sentiment analysis,”Information Fusion, vol. 102, p. 102031, 2024

  54. [54]

    Decoupled multimodal distilling for emotion recognition,

    Y . Li, Y . Wang, and Z. Cui, “Decoupled multimodal distilling for emotion recognition,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 6631–6640

  55. [55]

    A unified self-distillation framework for multimodal sentiment analysis with uncertain missing modalities,

    M. Li, D. Yang, Y . Lei, S. Wang, S. Wang, L. Su, K. Yang, Y . Wang, M. Sun, and L. Zhang, “A unified self-distillation framework for multimodal sentiment analysis with uncertain missing modalities,” in Proceedings of the AAAI conference on artificial intelligence, vol. 38, no. 9, 2024, pp. 10 074–10 082

  56. [56]

    Learning with a wasserstein loss,

    C. Frogner, C. Zhang, H. Mobahi, M. Araya, and T. A. Poggio, “Learning with a wasserstein loss,”Advances in neural information processing systems, vol. 28, 2015

  57. [57]

    Improving diffusion-based image synthesis with context pre- diction,

    L. Yang, J. Liu, S. Hong, Z. Zhang, Z. Huang, Z. Cai, W. Zhang, and B. Cui, “Improving diffusion-based image synthesis with context pre- diction,”Advances in Neural Information Processing Systems, vol. 36, pp. 37 636–37 656, 2023

  58. [58]

    The distribution of a product from several sources to numerous localities,

    F. L. Hitchcock, “The distribution of a product from several sources to numerous localities,”Journal of mathematics and physics, vol. 20, no. 1-4, pp. 224–230, 1941

  59. [59]

    Cosine similarity to determine similarity measure: Study case in online essay assessment,

    A. R. Lahitani, A. E. Permanasari, and N. A. Setiawan, “Cosine similarity to determine similarity measure: Study case in online essay assessment,” in2016 4th International conference on cyber and IT service management. IEEE, 2016, pp. 1–6

  60. [60]

    Kullback-leibler divergence,

    S. Kullback, “Kullback-leibler divergence,” 1951

  61. [61]

    Various issues around the l1-norm distance,

    J.-D. Rolle, “Various issues around the l1-norm distance,”arXiv preprint arXiv:2110.04787, 2021

  62. [62]

    Euclidean distance matrices: essential theory, algorithms, and applications,

    I. Dokmanic, R. Parhizkar, J. Ranieri, and M. Vetterli, “Euclidean distance matrices: essential theory, algorithms, and applications,”IEEE Signal Processing Magazine, vol. 32, no. 6, pp. 12–30, 2015

  63. [63]

    Lightspeed computation of optimal transportation distances,

    M. Cuturi, “Lightspeed computation of optimal transportation distances,” Advances in Neural Information Processing Systems, vol. 26, no. 2, pp. 2292–2300, 2013

  64. [64]

    A point set generation network for 3d object reconstruction from a single image,

    H. Fan, H. Su, and L. J. Guibas, “A point set generation network for 3d object reconstruction from a single image,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 605–613

  65. [65]

    A shortest augmenting path algorithm for dense and sparse linear assignment problems,

    R. Jonker and T. V olgenant, “A shortest augmenting path algorithm for dense and sparse linear assignment problems,” inDGOR/NSOR: Papers of the 16th Annual Meeting of DGOR in Cooperation with NSOR/Vortr¨age der 16. Jahrestagung der DGOR zusammen mit der NSOR. Springer, 1988, pp. 622–622

  66. [66]

    S. Bird, E. Klein, and E. Loper,Natural language processing with Python: analyzing text with the natural language toolkit. ” O’Reilly Media, Inc.”, 2009

  67. [67]

    Villaniet al.,Optimal transport: old and new

    C. Villaniet al.,Optimal transport: old and new. Springer, 2008, vol. 338

  68. [68]

    Self-attention with relative position repre- sentations

    P. Shaw, J. Uszkoreit, and A. Vaswani, “Self-attention with relative position representations,”arXiv preprint arXiv:1803.02155, 2018

  69. [69]

    Rethinking and improv- ing relative position encoding for vision transformer,

    K. Wu, H. Peng, M. Chen, J. Fu, and H. Chao, “Rethinking and improv- ing relative position encoding for vision transformer,” inProceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 10 033–10 041

  70. [70]

    Preparing a collection of radiology examinations for distribution and retrieval,

    D. Demner-Fushman, M. D. Kohli, M. B. Rosenman, S. E. Shooshan, L. Rodriguez, S. Antani, G. R. Thoma, and C. J. McDonald, “Preparing a collection of radiology examinations for distribution and retrieval,” Journal of the American Medical Informatics Association, vol. 23, no. 2, pp. 304–310, 2016

  71. [71]

    Hybrid retrieval-generation reinforced agent for medical image report generation,

    Y . Li, X. Liang, Z. Hu, and E. P. Xing, “Hybrid retrieval-generation reinforced agent for medical image report generation,”Advances in neural information processing systems, vol. 31, 2018

  72. [72]

    Exploring and distilling posterior and prior knowledge for radiology report generation,

    F. Liu, X. Wu, S. Ge, W. Fan, and Y . Zou, “Exploring and distilling posterior and prior knowledge for radiology report generation,” inPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 13 753–13 762

  73. [73]

    arXiv:1901.07042 (2019) 10 A.Rafferty et al

    A. E. Johnson, T. J. Pollard, N. R. Greenbaum, M. P. Lungren, C.-y. Deng, Y . Peng, Z. Lu, R. G. Mark, S. J. Berkowitz, and S. Horng, “Mimic-cxr-jpg, a large publicly available database of labeled chest radiographs,”arXiv preprint arXiv:1901.07042, 2019

  74. [74]

    Bleu: a method for automatic evaluation of machine translation,

    K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “Bleu: a method for automatic evaluation of machine translation,” inProceedings of the 40th annual meeting of the Association for Computational Linguistics, 2002, pp. 311–318

  75. [75]

    Meteor 1.3: Automatic metric for re- liable optimization and evaluation of machine translation systems,

    M. Denkowski and A. Lavie, “Meteor 1.3: Automatic metric for re- liable optimization and evaluation of machine translation systems,” in Proceedings of the sixth workshop on statistical machine translation, 2011, pp. 85–91

  76. [76]

    Rouge: A package for automatic evaluation of summaries,

    C.-Y . Lin, “Rouge: A package for automatic evaluation of summaries,” inText summarization branches out, 2004, pp. 74–81

  77. [77]

    Cider: Consensus- based image description evaluation,

    R. Vedantam, C. Lawrence Zitnick, and D. Parikh, “Cider: Consensus- based image description evaluation,” inProceedings of the IEEE confer- ence on computer vision and pattern recognition, 2015, pp. 4566–4575

  78. [78]

    Competence-based multimodal curriculum learning for medical report generation,

    F. Liu, S. Ge, Y . Zou, and X. Wu, “Competence-based multimodal curriculum learning for medical report generation,”arXiv preprint arXiv:2206.14579, 2022

  79. [79]

    Interactive and ex- plainable region-guided radiology report generation,

    T. Tanida, P. M ¨uller, G. Kaissis, and D. Rueckert, “Interactive and ex- plainable region-guided radiology report generation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 7433–7442

  80. [80]

    Promptmrg: Diagnosis-driven prompts for medical report generation,

    H. Jin, H. Che, Y . Lin, and H. Chen, “Promptmrg: Diagnosis-driven prompts for medical report generation,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 3, 2024, pp. 2607– 2615

Showing first 80 references.