Recognition: unknown
RIHA: Report-Image Hierarchical Alignment for Radiology Report Generation
Pith reviewed 2026-05-07 09:54 UTC · model grok-4.3
The pith
RIHA generates more accurate radiology reports by aligning image features with report structure at paragraph, sentence, and word levels.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
RIHA is an end-to-end framework that performs multi-level alignment between radiological images and their corresponding reports across paragraph, sentence, and word levels. It introduces a Visual Feature Pyramid to extract multi-scale visual features and a Text Feature Pyramid to represent multi-granularity textual structures, integrated via a Cross-modal Hierarchical Alignment module that uses optimal transport. Relative Positional Encoding is added to the decoder to enhance token-level alignment, leading to superior performance on benchmark datasets.
What carries the argument
The Cross-modal Hierarchical Alignment module, which leverages optimal transport to align multi-scale visual features from the Visual Feature Pyramid with multi-granularity textual structures from the Text Feature Pyramid at paragraph, sentence, and word levels.
If this is right
- Outperforms existing state-of-the-art models in natural language generation metrics on IU-Xray and MIMIC-CXR
- Records higher clinical efficacy metric scores on the same two datasets
- Captures nuanced semantics in clinical narratives through precise cross-modal mapping
- Strengthens token-level alignment between visual features and generated text via relative positional encoding
Where Pith is reading between the lines
- The same multi-level alignment idea could be tested on report generation from CT or MRI volumes where spatial hierarchies are even more pronounced
- Embedding such a model in clinical workflows might shorten report turnaround times by surfacing structured findings automatically
- The emphasis on report structure implies that hierarchical alignment techniques may transfer to other image-to-long-text tasks such as pathology slide captioning
Load-bearing premise
That multi-level feature pyramids combined with optimal transport will produce clinically faithful alignments without introducing or omitting critical diagnostic details that flat-sequence models already handle adequately.
What would settle it
A side-by-side review of generated reports on test cases containing subtle or multiple findings, checking whether RIHA omits or fabricates a key abnormality that a non-hierarchical baseline correctly includes or excludes.
Figures
read the original abstract
Radiology report generation (RRG) has emerged as a promising approach to alleviate radiologists' workload and reduce human errors by automatically generating diagnostic reports from medical images. A key challenge in RRG is achieving fine-grained alignment between complex visual features and the hierarchical structure of long-form radiology reports. Although recent methods have improved image-text representation learning, they often treat reports as flat sequences, overlooking their structured sections and semantic hierarchies. This simplification hinders precise cross-modal alignment and weakens RRG accuracy. To address this challenge, we propose RIHA (Report-Image Hierarchical Alignment Transformer), a novel end-to-end framework that performs multi-level alignment between radiological images and their corresponding reports across paragraph, sentence, and word levels. This hierarchical alignment enables more precise cross-modal mapping, essential for capturing the nuanced semantics embedded in clinical narratives. Specifically, RIHA introduces a Visual Feature Pyramid (VFP) to extract multi-scale visual features and a Text Feature Pyramid (TFP) to represent multi-granularity textual structures. These components are integrated through a Cross-modal Hierarchical Alignment (CHA) module, leveraging optimal transport to effectively align visual and textual features across various levels. Furthermore, we incorporate Relative Positional Encoding (RPE) into the decoder to model spatial and semantic relationships among tokens, enhancing the token-level alignment between visual features and generated text. Extensive experiments on two benchmark chest X-ray datasets, IU-Xray and MIMIC-CXR, demonstrate that RIHA outperforms existing state-of-the-art models in both natural language generation and clinical efficacy metrics.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes RIHA, a transformer-based framework for radiology report generation that extracts multi-scale visual features via a Visual Feature Pyramid (VFP), multi-granularity textual features via a Text Feature Pyramid (TFP), and aligns them across paragraph/sentence/word levels using a Cross-modal Hierarchical Alignment (CHA) module based on optimal transport; it further adds Relative Positional Encoding in the decoder. Experiments on IU-Xray and MIMIC-CXR are reported to show gains over prior SOTA in both NLG metrics and clinical efficacy metrics.
Significance. If the hierarchical alignment demonstrably improves clinical fidelity without introducing omissions, the work would usefully extend representation learning for structured medical text generation. The integration of optimal transport across explicit pyramid levels is a concrete technical step beyond flat-sequence baselines.
major comments (2)
- [CHA module and Experiments] The central claim that VFP+TFP+CHA yields clinically faithful alignments superior to flat-sequence models rests on aggregate NLG and clinical metrics; however, the manuscript provides no entity-level error analysis (e.g., per-finding omission rates for lesions, negations, or rare descriptors) to rule out the possibility that global optimal transport improves averages while still dropping sparse high-stakes details that flat models already capture.
- [Experiments] The experimental section must report baseline re-implementations, statistical significance tests, error bars across runs, and explicit checks for data leakage between train/test splits on both IU-Xray and MIMIC-CXR; without these, the reported outperformance cannot be taken as load-bearing evidence for the hierarchical design.
minor comments (2)
- [Method] Notation for the three pyramid levels (paragraph/sentence/word) and the precise formulation of the optimal-transport cost matrix should be introduced with a single equation block rather than scattered prose.
- [Figures] Figure captions for the overall architecture and feature-pyramid diagrams should explicitly label the input/output tensors at each level to aid reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback, which helps clarify the need for more granular validation of our hierarchical alignment claims and stronger experimental protocols. We address each major comment below and will revise the manuscript to incorporate the requested elements where feasible.
read point-by-point responses
-
Referee: [CHA module and Experiments] The central claim that VFP+TFP+CHA yields clinically faithful alignments superior to flat-sequence models rests on aggregate NLG and clinical metrics; however, the manuscript provides no entity-level error analysis (e.g., per-finding omission rates for lesions, negations, or rare descriptors) to rule out the possibility that global optimal transport improves averages while still dropping sparse high-stakes details that flat models already capture.
Authors: We agree that aggregate metrics alone leave open the possibility of overlooking omissions in sparse but clinically critical elements. In the revised manuscript we will add an entity-level error analysis section that reports per-finding omission rates for lesions, negations, and rare descriptors on both IU-Xray and MIMIC-CXR, directly comparing RIHA against the strongest flat-sequence baselines to quantify whether the hierarchical optimal-transport alignment reduces such omissions. revision: yes
-
Referee: [Experiments] The experimental section must report baseline re-implementations, statistical significance tests, error bars across runs, and explicit checks for data leakage between train/test splits on both IU-Xray and MIMIC-CXR; without these, the reported outperformance cannot be taken as load-bearing evidence for the hierarchical design.
Authors: We acknowledge that these experimental details are necessary for the results to serve as load-bearing evidence. The revised experimental section will include: (i) re-implementations of the primary baselines using the same training protocols, (ii) statistical significance tests (paired t-tests with p-values) on the NLG and clinical metrics, (iii) error bars computed over multiple random seeds, and (iv) explicit data-leakage verification confirming that no patient or study overlap exists between train and test splits on either dataset. revision: yes
Circularity Check
No circularity: RIHA's multi-level alignment is an independent architectural proposal evaluated empirically
full rationale
The paper introduces RIHA as a novel end-to-end framework combining Visual Feature Pyramid (VFP), Text Feature Pyramid (TFP), Cross-modal Hierarchical Alignment (CHA) via optimal transport, and Relative Positional Encoding (RPE). Performance claims rest on standard benchmark evaluations (IU-Xray, MIMIC-CXR) for NLG and clinical metrics, with no equations or derivations that reduce predictions to fitted inputs, self-definitions, or self-citation chains. The hierarchical alignment is motivated externally by limitations of flat-sequence models rather than being tautological with its own components. No load-bearing step equates any result to its construction inputs.
Axiom & Free-Parameter Ledger
free parameters (2)
- Feature pyramid scales and dimensions
- Optimal transport regularization parameters
axioms (2)
- domain assumption Optimal transport provides an effective mechanism for cross-modal alignment at multiple granularities
- domain assumption Relative positional encoding improves token-level alignment between visual features and generated text
Reference graph
Works this paper leans on
-
[1]
The shortage of radiographers: A global crisis in healthcare,
K. Konstantinidis, “The shortage of radiographers: A global crisis in healthcare,”Journal of medical imaging and radiation sciences, vol. 55, no. 4, p. 101333, 2024
2024
-
[2]
W. Chen, L. Shen, J. Lin, J. Luo, X. Li, and Y . Yuan, “Fine-grained image-text alignment in medical imaging enables explainable cyclic image-report generation,”arXiv preprint arXiv:2312.08078, 2023
-
[3]
From vision to text: A comprehensive review of natural image captioning in medical diagnosis and radiology report generation,
G. Reale-Nosei, E. Amador-Dom ´ınguez, and E. Serrano, “From vision to text: A comprehensive review of natural image captioning in medical diagnosis and radiology report generation,”Medical Image Analysis, p. 103264, 2024
2024
-
[4]
Vision-language models for medical report generation and visual question answering: A review,
I. Hartsock and G. Rasool, “Vision-language models for medical report generation and visual question answering: A review,”Frontiers in Artificial Intelligence, vol. 7, p. 1430984, 2024
2024
-
[5]
Show, attend and tell: Neural image caption generation with visual attention,
K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, and Y . Bengio, “Show, attend and tell: Neural image caption generation with visual attention,” inInternational conference on machine learning. PMLR, 2015, pp. 2048–2057
2015
-
[6]
On the automatic generation of medical imaging reports,
B. Jing, P. Xie, and E. Xing, “On the automatic generation of medical imaging reports,”arXiv preprint arXiv:1711.08195, 2017
-
[7]
Mlip: Enhancing medical visual representation with divergence encoder and knowledge-guided contrastive learning,
Z. Li, L. T. Yang, B. Ren, X. Nie, Z. Gao, C. Tan, and S. Z. Li, “Mlip: Enhancing medical visual representation with divergence encoder and knowledge-guided contrastive learning,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 11 704–11 714
2024
-
[8]
Diverse and coherent paragraph generation from images,
M. Chatterjee and A. G. Schwing, “Diverse and coherent paragraph generation from images,” inProceedings of the European conference on computer vision (ECCV), 2018, pp. 729–744
2018
-
[9]
Chexpert: A large chest radiograph dataset with uncertainty labels and expert comparison,
J. Irvin, P. Rajpurkar, M. Ko, Y . Yu, S. Ciurea-Ilcus, C. Chute, H. Mark- lund, B. Haghgoo, R. Ball, K. Shpanskayaet al., “Chexpert: A large chest radiograph dataset with uncertainty labels and expert comparison,” inProceedings of the AAAI conference on artificial intelligence, vol. 33, no. 01, 2019, pp. 590–597
2019
-
[10]
Cross-modal prototype driven network for radiology report generation,
J. Wang, A. Bhalerao, and Y . He, “Cross-modal prototype driven network for radiology report generation,” inEuropean Conference on Computer Vision. Springer, 2022, pp. 563–579
2022
-
[11]
Generating radiology re- ports via memory-driven transformer,
Z. Chen, Y . Song, T.-H. Chang, and X. Wan, “Generating radiology re- ports via memory-driven transformer,”arXiv preprint arXiv:2010.16056, 2020
-
[12]
Camanet: class activation map guided attention network for radiology report generation,
J. Wang, A. Bhalerao, T. Yin, S. See, and Y . He, “Camanet: class activation map guided attention network for radiology report generation,” IEEE Journal of Biomedical and Health Informatics, vol. 28, no. 4, pp. 2199–2210, 2024
2024
-
[13]
Z. Zhang, Y . Yu, Y . Chen, X. Yang, and S. Y . Yeo, “Medunifier: Unifying vision-and-language pre-training on medical data with vision generation task using discrete visual representations,”arXiv preprint arXiv:2503.01019, 2025
-
[14]
Cross-modal memory networks for radiology report generation,
Z. Chen, Y . Shen, Y . Song, and X. Wan, “Cross-modal memory networks for radiology report generation,”arXiv preprint arXiv:2204.13258, 2022
-
[15]
Dynamic graph enhanced contrastive learning for chest x-ray report generation,
M. Li, B. Lin, Z. Chen, H. Lin, X. Liang, and X. Chang, “Dynamic graph enhanced contrastive learning for chest x-ray report generation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 3334–3343
2023
-
[16]
Kiut: Knowledge-injected u- transformer for radiology report generation,
Z. Huang, X. Zhang, and S. Zhang, “Kiut: Knowledge-injected u- transformer for radiology report generation,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 19 809–19 818
2023
-
[17]
Cross- modal clinical graph transformer for ophthalmic report generation,
M. Li, W. Cai, K. Verspoor, S. Pan, X. Liang, and X. Chang, “Cross- modal clinical graph transformer for ophthalmic report generation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 20 656–20 665
2022
-
[18]
Automatic generation of medical report with knowledge graph,
H. Zhao, J. Chen, L. Huang, T. Yang, W. Ding, and C. Li, “Automatic generation of medical report with knowledge graph,” inProceedings of the 2021 10th International Conference on Computing and Pattern Recognition, 2021, pp. 1–1
2021
-
[19]
When radiology report generation meets knowledge graph,
Y . Zhang, X. Wang, Z. Xu, Q. Yu, A. Yuille, and D. Xu, “When radiology report generation meets knowledge graph,” inProceedings of the AAAI conference on artificial intelligence, vol. 34, no. 07, 2020, pp. 12 910–12 917
2020
-
[20]
Radiology report generation with a learned knowledge base and multi-modal alignment,
S. Yang, X. Wu, S. Ge, Z. Zheng, S. K. Zhou, and L. Xiao, “Radiology report generation with a learned knowledge base and multi-modal alignment,”Medical Image Analysis, vol. 86, p. 102798, 2023
2023
-
[21]
Knowledge matters: Chest radiology report generation with general and specific knowledge,
S. Yang, X. Wu, S. Ge, S. K. Zhou, and L. Xiao, “Knowledge matters: Chest radiology report generation with general and specific knowledge,” Medical image analysis, vol. 80, p. 102510, 2022
2022
-
[22]
Unify, align and refine: Multi-level semantic alignment for radiology report generation,
Y . Li, B. Yang, X. Cheng, Z. Zhu, H. Li, and Y . Zou, “Unify, align and refine: Multi-level semantic alignment for radiology report generation,” inProceedings of the IEEE/CVF international conference on computer vision, 2023, pp. 2863–2874
2023
-
[23]
Multi-grained radiology report generation with sentence-level image-language contrastive learning,
A. Liu, Y . Guo, J.-h. Yong, and F. Xu, “Multi-grained radiology report generation with sentence-level image-language contrastive learning,” IEEE Transactions on Medical Imaging, vol. 43, no. 7, pp. 2657–2669, 2024
2024
-
[24]
Level set segmentation with robust image gradient energy and statistical shape prior,
S. Y . Yeo, X. Xie, I. Sazonov, and P. Nithiarasu, “Level set segmentation with robust image gradient energy and statistical shape prior,” in2011 18th IEEE International Conference on Image Processing. IEEE, 2011, pp. 3397–3400
2011
-
[25]
Cardiac image segmentation by random walks with dynamic shape constraint,
X. Yang, Y . Su, R. Duan, H. Fan, S. Y . Yeo, C. Lim, L. Zhong, and R. S. Tan, “Cardiac image segmentation by random walks with dynamic shape constraint,”IET Computer Vision, vol. 10, no. 1, pp. 79–86, 2016
2016
-
[26]
A novel multi-task deep learning model for skin lesion segmentation and classification,
X. Yang, Z. Zeng, S. Y . Yeo, C. Tan, H. L. Tey, and Y . Su, “A novel multi-task deep learning model for skin lesion segmentation and classification,”arXiv preprint arXiv:1703.01025, 2017
-
[27]
Effdnet: A scribble- supervised medical image segmentation method with enhanced fore- ground feature discrimination,
J. Liu, S. Y . Tan, X. Yang, Y . Xu, and S. Y . Yeo, “Effdnet: A scribble- supervised medical image segmentation method with enhanced fore- ground feature discrimination,” inInternational Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, 2025, pp. 194–204
2025
-
[28]
Knowing when to look: Adaptive attention via a visual sentinel for image captioning,
J. Lu, C. Xiong, D. Parikh, and R. Socher, “Knowing when to look: Adaptive attention via a visual sentinel for image captioning,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 375–383
2017
-
[29]
Bottom-up and top-down attention for image captioning and visual question answering,
P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, and L. Zhang, “Bottom-up and top-down attention for image captioning and visual question answering,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 6077–6086
2018
-
[30]
Attention is all you need,
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,”Advances in neural information processing systems, vol. 30, 2017
2017
-
[31]
Context based text-generation using lstm networks,
S. Santhanam, “Context based text-generation using lstm networks,” arXiv preprint arXiv:2005.00048, 2020
-
[32]
Attention based automated radiology report generation using cnn and lstm,
M. Sirshar, M. F. K. Paracha, M. U. Akram, N. S. Alghamdi, S. Z. Y . Zaidi, and T. Fatima, “Attention based automated radiology report generation using cnn and lstm,”Plos one, vol. 17, no. 1, p. e0262209, 2022
2022
-
[33]
Self- critical sequence training for image captioning,
S. J. Rennie, E. Marcheret, Y . Mroueh, J. Ross, and V . Goel, “Self- critical sequence training for image captioning,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 7008–7024
2017
-
[34]
Towards diverse and natural image descriptions via a conditional gan,
B. Dai, S. Fidler, R. Urtasun, and D. Lin, “Towards diverse and natural image descriptions via a conditional gan,” inProceedings of the IEEE international conference on computer vision, 2017, pp. 2970–2979
2017
-
[35]
Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation,
J. Li, D. Li, C. Xiong, and S. Hoi, “Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation,” inInternational conference on machine learning. PMLR, 2022, pp. 12 888–12 900
2022
-
[36]
Oscar: Object-semantics aligned pre-training for vision-language tasks,
X. Li, X. Yin, C. Li, P. Zhang, X. Hu, L. Zhang, L. Wang, H. Hu, L. Dong, F. Weiet al., “Oscar: Object-semantics aligned pre-training for vision-language tasks,” inComputer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXX
2020
-
[37]
Springer, 2020, pp. 121–137
2020
-
[38]
Simvlm: Sim- ple visual language model pretraining with weak supervision
Z. Wang, J. Yu, A. W. Yu, Z. Dai, Y . Tsvetkov, and Y . Cao, “Simvlm: Simple visual language model pretraining with weak supervision,”arXiv preprint arXiv:2108.10904, 2021
-
[39]
Radiographic reports generation via retrieval enhanced cross-modal fusion,
X. Hou, Y . Luo, W. Song, Y . Guo, W. You, and S. Li, “Radiographic reports generation via retrieval enhanced cross-modal fusion,” in2024 IEEE International Conference on Bioinformatics and Biomedicine (BIBM). IEEE, 2024, pp. 2032–2039
2024
-
[40]
Spatio-temporal and retrieval-augmented modelling for chest x-ray report generation,
Y . Yang, X. You, K. Zhang, Z. Fu, X. Wang, J. Ding, J. Sun, Z. Yu, Q. Huang, W. Hanet al., “Spatio-temporal and retrieval-augmented modelling for chest x-ray report generation,”IEEE Transactions on Medical Imaging, 2025
2025
-
[41]
Tienet: Text- image embedding network for common thorax disease classification and reporting in chest x-rays,
X. Wang, Y . Peng, L. Lu, Z. Lu, and R. M. Summers, “Tienet: Text- image embedding network for common thorax disease classification and reporting in chest x-rays,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 9049–9058
2018
-
[42]
An organ-aware diagnosis framework for radiology report generation,
S. Li, P. Qiao, L. Wang, M. Ning, L. Yuan, Y . Zheng, and J. Chen, “An organ-aware diagnosis framework for radiology report generation,” 14 IEEE TRANSACTIONS AND JOURNALS TEMPLATE IEEE Transactions on Medical Imaging, vol. 43, no. 12, pp. 4253–4265, 2024
2024
-
[43]
Diffusion networks with task-specific noise control for radiology report generation,
Y . Tian, F. Xia, and Y . Song, “Diffusion networks with task-specific noise control for radiology report generation,” inProceedings of the 32nd ACM International Conference on Multimedia, 2024, pp. 1771– 1780
2024
-
[44]
Improving radiology report generation with multi-grained abnormality prediction,
Y . Jin, W. Chen, Y . Tian, Y . Song, and C. Yan, “Improving radiology report generation with multi-grained abnormality prediction,”Neurocom- puting, vol. 600, p. 128122, 2024
2024
-
[45]
Multi- granularity cross-modal alignment for generalized medical visual repre- sentation learning,
F. Wang, Y . Zhou, S. Wang, V . Vardhanabhuti, and L. Yu, “Multi- granularity cross-modal alignment for generalized medical visual repre- sentation learning,”Advances in neural information processing systems, vol. 35, pp. 33 536–33 549, 2022
2022
-
[46]
Unraveling cross- modality knowledge conflicts in large vision-language models,
T. Zhu, Q. Liu, F. Wang, Z. Tu, and M. Chen, “Unraveling cross- modality knowledge conflicts in large vision-language models,”arXiv preprint arXiv:2410.03659, 2024
-
[47]
Clip-adapter: Better vision-language models with feature adapters,
P. Gao, S. Geng, R. Zhang, T. Ma, R. Fang, Y . Zhang, H. Li, and Y . Qiao, “Clip-adapter: Better vision-language models with feature adapters,” International Journal of Computer Vision, vol. 132, no. 2, pp. 581–595, 2024
2024
-
[48]
Achieving cross modal generalization with multimodal unified representation,
Y . Xia, H. Huang, J. Zhu, and Z. Zhao, “Achieving cross modal generalization with multimodal unified representation,”Advances in Neural Information Processing Systems, vol. 36, pp. 63 529–63 541, 2023
2023
-
[49]
Multimodal transformer for unaligned multimodal language sequences,
Y .-H. H. Tsai, S. Bai, P. P. Liang, J. Z. Kolter, L.-P. Morency, and R. Salakhutdinov, “Multimodal transformer for unaligned multimodal language sequences,” inProceedings of the conference. Association for computational linguistics. Meeting, vol. 2019, 2019, p. 6558
2019
-
[50]
Disentangled representation learning for multimodal emotion recognition,
D. Yang, S. Huang, H. Kuang, Y . Du, and L. Zhang, “Disentangled representation learning for multimodal emotion recognition,” inProceed- ings of the 30th ACM international conference on multimedia, 2022, pp. 1642–1651
2022
-
[51]
Modality translation- based multimodal sentiment analysis under uncertain missing modali- ties,
Z. Liu, B. Zhou, D. Chu, Y . Sun, and L. Meng, “Modality translation- based multimodal sentiment analysis under uncertain missing modali- ties,”Information Fusion, vol. 101, p. 101973, 2024
2024
-
[52]
Multimodal disentanglement variational autoencoders for zero-shot cross-modal re- trieval,
J. Tian, K. Wang, X. Xu, Z. Cao, F. Shen, and H. T. Shen, “Multimodal disentanglement variational autoencoders for zero-shot cross-modal re- trieval,” inProceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2022, pp. 960– 969
2022
-
[53]
Disentanglement translation network for multimodal sentiment analysis,
Y . Zeng, W. Yan, S. Mai, and H. Hu, “Disentanglement translation network for multimodal sentiment analysis,”Information Fusion, vol. 102, p. 102031, 2024
2024
-
[54]
Decoupled multimodal distilling for emotion recognition,
Y . Li, Y . Wang, and Z. Cui, “Decoupled multimodal distilling for emotion recognition,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 6631–6640
2023
-
[55]
A unified self-distillation framework for multimodal sentiment analysis with uncertain missing modalities,
M. Li, D. Yang, Y . Lei, S. Wang, S. Wang, L. Su, K. Yang, Y . Wang, M. Sun, and L. Zhang, “A unified self-distillation framework for multimodal sentiment analysis with uncertain missing modalities,” in Proceedings of the AAAI conference on artificial intelligence, vol. 38, no. 9, 2024, pp. 10 074–10 082
2024
-
[56]
Learning with a wasserstein loss,
C. Frogner, C. Zhang, H. Mobahi, M. Araya, and T. A. Poggio, “Learning with a wasserstein loss,”Advances in neural information processing systems, vol. 28, 2015
2015
-
[57]
Improving diffusion-based image synthesis with context pre- diction,
L. Yang, J. Liu, S. Hong, Z. Zhang, Z. Huang, Z. Cai, W. Zhang, and B. Cui, “Improving diffusion-based image synthesis with context pre- diction,”Advances in Neural Information Processing Systems, vol. 36, pp. 37 636–37 656, 2023
2023
-
[58]
The distribution of a product from several sources to numerous localities,
F. L. Hitchcock, “The distribution of a product from several sources to numerous localities,”Journal of mathematics and physics, vol. 20, no. 1-4, pp. 224–230, 1941
1941
-
[59]
Cosine similarity to determine similarity measure: Study case in online essay assessment,
A. R. Lahitani, A. E. Permanasari, and N. A. Setiawan, “Cosine similarity to determine similarity measure: Study case in online essay assessment,” in2016 4th International conference on cyber and IT service management. IEEE, 2016, pp. 1–6
2016
-
[60]
Kullback-leibler divergence,
S. Kullback, “Kullback-leibler divergence,” 1951
1951
-
[61]
Various issues around the l1-norm distance,
J.-D. Rolle, “Various issues around the l1-norm distance,”arXiv preprint arXiv:2110.04787, 2021
-
[62]
Euclidean distance matrices: essential theory, algorithms, and applications,
I. Dokmanic, R. Parhizkar, J. Ranieri, and M. Vetterli, “Euclidean distance matrices: essential theory, algorithms, and applications,”IEEE Signal Processing Magazine, vol. 32, no. 6, pp. 12–30, 2015
2015
-
[63]
Lightspeed computation of optimal transportation distances,
M. Cuturi, “Lightspeed computation of optimal transportation distances,” Advances in Neural Information Processing Systems, vol. 26, no. 2, pp. 2292–2300, 2013
2013
-
[64]
A point set generation network for 3d object reconstruction from a single image,
H. Fan, H. Su, and L. J. Guibas, “A point set generation network for 3d object reconstruction from a single image,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 605–613
2017
-
[65]
A shortest augmenting path algorithm for dense and sparse linear assignment problems,
R. Jonker and T. V olgenant, “A shortest augmenting path algorithm for dense and sparse linear assignment problems,” inDGOR/NSOR: Papers of the 16th Annual Meeting of DGOR in Cooperation with NSOR/Vortr¨age der 16. Jahrestagung der DGOR zusammen mit der NSOR. Springer, 1988, pp. 622–622
1988
-
[66]
S. Bird, E. Klein, and E. Loper,Natural language processing with Python: analyzing text with the natural language toolkit. ” O’Reilly Media, Inc.”, 2009
2009
-
[67]
Villaniet al.,Optimal transport: old and new
C. Villaniet al.,Optimal transport: old and new. Springer, 2008, vol. 338
2008
-
[68]
Self-attention with relative position repre- sentations
P. Shaw, J. Uszkoreit, and A. Vaswani, “Self-attention with relative position representations,”arXiv preprint arXiv:1803.02155, 2018
-
[69]
Rethinking and improv- ing relative position encoding for vision transformer,
K. Wu, H. Peng, M. Chen, J. Fu, and H. Chao, “Rethinking and improv- ing relative position encoding for vision transformer,” inProceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 10 033–10 041
2021
-
[70]
Preparing a collection of radiology examinations for distribution and retrieval,
D. Demner-Fushman, M. D. Kohli, M. B. Rosenman, S. E. Shooshan, L. Rodriguez, S. Antani, G. R. Thoma, and C. J. McDonald, “Preparing a collection of radiology examinations for distribution and retrieval,” Journal of the American Medical Informatics Association, vol. 23, no. 2, pp. 304–310, 2016
2016
-
[71]
Hybrid retrieval-generation reinforced agent for medical image report generation,
Y . Li, X. Liang, Z. Hu, and E. P. Xing, “Hybrid retrieval-generation reinforced agent for medical image report generation,”Advances in neural information processing systems, vol. 31, 2018
2018
-
[72]
Exploring and distilling posterior and prior knowledge for radiology report generation,
F. Liu, X. Wu, S. Ge, W. Fan, and Y . Zou, “Exploring and distilling posterior and prior knowledge for radiology report generation,” inPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 13 753–13 762
2021
-
[73]
arXiv:1901.07042 (2019) 10 A.Rafferty et al
A. E. Johnson, T. J. Pollard, N. R. Greenbaum, M. P. Lungren, C.-y. Deng, Y . Peng, Z. Lu, R. G. Mark, S. J. Berkowitz, and S. Horng, “Mimic-cxr-jpg, a large publicly available database of labeled chest radiographs,”arXiv preprint arXiv:1901.07042, 2019
-
[74]
Bleu: a method for automatic evaluation of machine translation,
K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “Bleu: a method for automatic evaluation of machine translation,” inProceedings of the 40th annual meeting of the Association for Computational Linguistics, 2002, pp. 311–318
2002
-
[75]
Meteor 1.3: Automatic metric for re- liable optimization and evaluation of machine translation systems,
M. Denkowski and A. Lavie, “Meteor 1.3: Automatic metric for re- liable optimization and evaluation of machine translation systems,” in Proceedings of the sixth workshop on statistical machine translation, 2011, pp. 85–91
2011
-
[76]
Rouge: A package for automatic evaluation of summaries,
C.-Y . Lin, “Rouge: A package for automatic evaluation of summaries,” inText summarization branches out, 2004, pp. 74–81
2004
-
[77]
Cider: Consensus- based image description evaluation,
R. Vedantam, C. Lawrence Zitnick, and D. Parikh, “Cider: Consensus- based image description evaluation,” inProceedings of the IEEE confer- ence on computer vision and pattern recognition, 2015, pp. 4566–4575
2015
-
[78]
Competence-based multimodal curriculum learning for medical report generation,
F. Liu, S. Ge, Y . Zou, and X. Wu, “Competence-based multimodal curriculum learning for medical report generation,”arXiv preprint arXiv:2206.14579, 2022
-
[79]
Interactive and ex- plainable region-guided radiology report generation,
T. Tanida, P. M ¨uller, G. Kaissis, and D. Rueckert, “Interactive and ex- plainable region-guided radiology report generation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 7433–7442
2023
-
[80]
Promptmrg: Diagnosis-driven prompts for medical report generation,
H. Jin, H. Che, Y . Lin, and H. Chen, “Promptmrg: Diagnosis-driven prompts for medical report generation,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 3, 2024, pp. 2607– 2615
2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.