pith. machine review for the scientific record. sign in

arxiv: 2604.20306 · v1 · submitted 2026-04-22 · 💻 cs.CV · cs.AI

Recognition: unknown

Dual Causal Inference: Integrating Backdoor Adjustment and Instrumental Variable Learning for Medical VQA

Authors on Pith no claims yet

Pith reviewed 2026-05-10 00:02 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords medical visual question answeringcausal inferencebackdoor adjustmentinstrumental variabledeconfounded representationsstructural causal modelout-of-distribution generalizationmultimodal learning
0
0 comments X

The pith

A dual causal method combines backdoor adjustment with instrumental variable learning to remove both visible and hidden biases from medical visual question answering.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Medical visual question answering systems frequently latch onto superficial image-question patterns instead of genuine diagnostic links. The paper introduces Dual Causal Inference, a unified framework that builds a structural causal model to separate observable confounders via backdoor adjustment from unobserved ones via an instrumental variable. Mutual information constraints are imposed to ensure the instrumental variable stays independent of hidden biases while remaining tied to the fused representations. This produces deconfounded features that reflect true causal relationships rather than data shortcuts. Tests on SLAKE, SLAKE-CP, VQA-RAD, and PathVQA show consistent gains, especially under out-of-distribution shifts.

Core claim

DCI is the first unified architecture that integrates Backdoor Adjustment and Instrumental Variable learning to jointly tackle both observable and unobserved confounders in MedVQA, extracting deconfounded representations that capture genuine causal relationships through a structural causal model and mutual information constraints.

What carries the argument

The structural causal model that applies backdoor adjustment to observable cross-modal biases while learning a valid instrumental variable from a shared latent space under mutual information constraints.

Load-bearing premise

The mutual information constraints are sufficient to produce an instrumental variable that is independent of unobserved confounders while still informative for the multimodal representations.

What would settle it

Evaluation on a dataset where new unobserved confounders are deliberately injected and the mutual information constraints fail to isolate them, resulting in performance no better than non-causal baselines.

Figures

Figures reproduced from arXiv: 2604.20306 by Jin Wang, Ke Lu, Qiang Li, Weizhi Nie, Yuting Su, Zibo Xu.

Figure 1
Figure 1. Figure 1: Motivation for applying causal inference in MedVQA. (a) Conven￾tional MedVQA models estimate the observational probability P(A|V, Q). They easily fall prey to spurious correlations driven by dataset biases, leading to incorrect predictions. (b) Our proposed framework intervenes via P(A|do(V, Q)) utilizing a dual causal approach (BDA and IV). This structurally mitigates confounding effects, enabling the mod… view at source ↗
Figure 2
Figure 2. Figure 2: Structural Causal Model (SCM) for multi-modal MedVQA. Gray [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Overview of the proposed Dual Causal Inference (DCI) framework. It employs a sequential deconfounding paradigm to mitigate both observable and unobserved biases. The Backdoor Adjustment module (top) leverages confounder dictionaries to eliminate explicit dataset biases, yielding the observable￾deconfounded representation X. Subsequently, the Instrumental Variable Learning module (bottom) distills a valid l… view at source ↗
Figure 4
Figure 4. Figure 4: Hyperparameter sensitivity of the dictionary size [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative results on SLAKE under the challenging “one-image, multi-question” scenario. Green bounding boxes denote the true reasons, red bounding [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative results across various medical modalities and organs. [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
read the original abstract

Medical Visual Question Answering (MedVQA) aims to generate clinically reliable answers conditioned on complex medical images and questions. However, existing methods often overfit to superficial cross-modal correlations, neglecting the intrinsic biases embedded in multimodal medical data. Consequently, models become vulnerable to cross-modal confounding effects, severely hindering their ability to provide trustworthy diagnostic reasoning. To address this limitation, we propose a novel Dual Causal Inference (DCI) framework for MedVQA. To the best of our knowledge, DCI is the first unified architecture that integrates Backdoor Adjustment (BDA) and Instrumental Variable (IV) learning to jointly tackle both observable and unobserved confounders. Specifically, we formulate a Structural Causal Model (SCM) where observable cross-modal biases (e.g., frequent visual and textual co-occurrences) are mitigated via BDA, while unobserved confounders are compensated using an IV learned from a shared latent space. To guarantee the validity of the IV, we design mutual information constraints that maximize its dependence on the fused multimodal representations while minimizing its associations with the unobserved confounders and target answers. Through this dual mechanism, DCI extracts deconfounded representations that capture genuine causal relationships. Extensive experiments on four benchmark datasets, SLAKE, SLAKE-CP, VQA-RAD, and PathVQA, demonstrate that our method consistently outperforms existing approaches, particularly in out-of-distribution (OOD) generalization. Furthermore, qualitative analyses confirm that DCI significantly enhances the interpretability and robustness of cross-modal reasoning by explicitly disentangling true causal effects from spurious cross-modal shortcuts.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes Dual Causal Inference (DCI), a unified framework for Medical Visual Question Answering (MedVQA) that integrates Backdoor Adjustment (BDA) to mitigate observable cross-modal biases and Instrumental Variable (IV) learning to handle unobserved confounders within a Structural Causal Model (SCM). An IV is learned from a shared latent space and regularized via mutual information constraints that aim to maximize dependence on fused multimodal representations while minimizing associations with unobserved confounders and target answers. Experiments on SLAKE, SLAKE-CP, VQA-RAD, and PathVQA report consistent gains over baselines, with emphasis on improved out-of-distribution generalization and interpretability.

Significance. If the deconfounding claims hold, the work would offer a practical route to more reliable MedVQA systems by explicitly separating causal effects from spurious correlations in multimodal medical data, which could improve robustness in clinical settings where unobserved factors such as imaging artifacts or prior knowledge are common.

major comments (2)
  1. [Abstract and Methods (IV construction)] Abstract and Methods (IV construction): The mutual information constraints are asserted to guarantee a valid IV satisfying relevance, independence from unobserved confounders, and the exclusion restriction. However, because MI estimation in neural networks is variational and approximate, the manuscript must supply either an identification proof under the SCM or empirical bounds showing that the learned latent satisfies the three IV conditions; without these, the deconfounding guarantee does not necessarily follow from the stated penalties.
  2. [Experiments section] Experiments section: While outperformance on four benchmarks including OOD splits is reported, the absence of ablations that isolate the incremental contribution of the BDA module versus the IV module, or of statistical tests comparing against strong multimodal baselines, leaves open the possibility that gains arise from other architectural choices rather than the dual causal mechanism.
minor comments (2)
  1. [Abstract] The abstract states that DCI is 'the first unified architecture' integrating BDA and IV; a brief citation to the closest prior causal MedVQA works would strengthen this positioning.
  2. [Methods] Notation for the shared latent space and the precise form of the MI terms (e.g., which estimators are used) should be introduced earlier and used consistently in the methods description.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive feedback. We address each major comment point-by-point below, indicating the revisions we will incorporate to strengthen the manuscript.

read point-by-point responses
  1. Referee: Abstract and Methods (IV construction): The mutual information constraints are asserted to guarantee a valid IV satisfying relevance, independence from unobserved confounders, and the exclusion restriction. However, because MI estimation in neural networks is variational and approximate, the manuscript must supply either an identification proof under the SCM or empirical bounds showing that the learned latent satisfies the three IV conditions; without these, the deconfounding guarantee does not necessarily follow from the stated penalties.

    Authors: We agree that the variational approximation of mutual information means the constraints provide only an empirical proxy rather than a strict guarantee. In the revised manuscript we will add a new subsection (Methods 3.4) containing (i) a proof sketch under the SCM showing how the three MI penalties encourage relevance, independence from U, and exclusion, and (ii) empirical bounds obtained by reporting the estimated MI values together with sensitivity analyses on the learned latent across all four datasets. These additions will make the deconfounding claim more rigorous. revision: yes

  2. Referee: Experiments section: While outperformance on four benchmarks including OOD splits is reported, the absence of ablations that isolate the incremental contribution of the BDA module versus the IV module, or of statistical tests comparing against strong multimodal baselines, leaves open the possibility that gains arise from other architectural choices rather than the dual causal mechanism.

    Authors: We concur that component-wise ablations and statistical significance tests are necessary to attribute gains specifically to the dual causal design. In the revised Experiments section we will add (i) three controlled variants—DCI without BDA, DCI without IV, and full DCI—evaluated on SLAKE, SLAKE-CP, VQA-RAD, and PathVQA (including OOD splits), and (ii) paired statistical tests (McNemar’s test and Wilcoxon signed-rank test with p-values) against the strongest multimodal baselines. The new tables will quantify the incremental benefit of each module. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected in the derivation chain

full rationale

The paper formulates an SCM for MedVQA, applies BDA for observable confounders, and introduces an IV from a shared latent space whose validity is enforced by mutual information constraints that maximize dependence on fused representations while minimizing links to confounders and answers. This design choice produces deconfounded representations as the optimization outcome, but the provided text contains no equations demonstrating that the final causal claims or predictions are identical to the fitted inputs by construction. No self-citations, uniqueness theorems, or ansatzes from prior author work are invoked as load-bearing steps. The central architecture is presented as a novel integration with external benchmark validation, rendering the derivation self-contained rather than tautological.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The framework rests on an assumed Structural Causal Model for medical VQA data and on the effectiveness of mutual-information constraints for IV validity; no numerical free parameters are stated in the abstract.

axioms (1)
  • domain assumption A Structural Causal Model accurately captures the observable and unobserved confounding structure in multimodal medical data.
    Invoked to justify the use of BDA for visible biases and IV for hidden biases.
invented entities (1)
  • Instrumental variable learned from shared latent space no independent evidence
    purpose: To compensate for unobserved confounders while satisfying mutual information constraints.
    Introduced as the mechanism for handling hidden confounders; no external falsifiable evidence (e.g., predicted observable) is supplied in the abstract.

pith-pipeline@v0.9.0 · 5588 in / 1321 out tokens · 46183 ms · 2026-05-10T00:02:24.625612+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. LiteMedCoT-VL: Parameter-Efficient Adaptation for Medical Visual Question Answering

    cs.CV 2026-05 unverdicted novelty 5.0

    LiteMedCoT-VL distills chain-of-thought from a 235B model to 2B VLMs via LoRA, reaching 64.9% accuracy on PMC-VQA and beating a 4B zero-shot baseline by 11 points.

Reference graph

Works this paper leans on

45 extracted references · 3 canonical work pages · cited by 1 Pith paper · 2 internal anchors

  1. [1]

    Multiscale feature extraction and fusion of image and text in vqa,

    S. Lu, Y . Ding, M. Liu, Z. Yin, L. Yin, and W. Zheng, “Multiscale feature extraction and fusion of image and text in vqa,”International Journal of Computational Intelligence Systems, vol. 16, no. 1, p. 54, 2023

  2. [2]

    From Images to Textual Prompts: Zero-shot Visual Question Answering with Frozen Large Language Models ,

    J. Guo, J. Li, D. Li, A. M. Huat Tiong, B. Li, D. Tao, and S. Hoi, “ From Images to Textual Prompts: Zero-shot Visual Question Answering with Frozen Large Language Models ,” inCVPR, 2023, pp. 10 867–10 877

  3. [3]

    Cross modality bias in visual question answering: A causal view with possible worlds vqa,

    A. V osoughi, S. Deng, S. Zhang, Y . Tian, C. Xu, and J. Luo, “Cross modality bias in visual question answering: A causal view with possible worlds vqa,”IEEE TMM, vol. 26, pp. 8609–8624, 2024

  4. [4]

    Pmc-llama: toward building open-source language models for medicine,

    C. Wu, W. Lin, X. Zhang, Y . Zhang, W. Xie, and Y . Wang, “Pmc-llama: toward building open-source language models for medicine,”Journal of the American Medical Informatics Association, vol. 31, no. 9, pp. 1833–1843, 2024

  5. [5]

    Pmc-vqa: Visual instruction tuning for medical visual question answering.arXiv preprint arXiv:2305.10415, 2023

    X. Zhang, C. Wu, Z. Zhao, W. Lin, Y . Zhang, Y . Wang, and W. Xie, “Pmc-vqa: Visual instruction tuning for medical visual question answer- ing,”arXiv preprint arXiv:2305.10415, 2023

  6. [6]

    Pubmedclip: How much does clip benefit visual question answering in the medical domain?

    S. Eslami, C. Meinel, and G. De Melo, “Pubmedclip: How much does clip benefit visual question answering in the medical domain?” in Findings of the Association for Computational Linguistics: EACL 2023, 2023, pp. 1151–1163

  7. [7]

    Masked vision and language pre-training with unimodal and multimodal contrastive losses for medical visual question answering,

    P. Li, G. Liu, J. He, Z. Zhao, and S. Zhong, “Masked vision and language pre-training with unimodal and multimodal contrastive losses for medical visual question answering,” inMICCAI. Springer, 2023, pp. 374–383

  8. [8]

    Multi-modal masked autoencoders for medical vision-and-language pre-training,

    Z. Chen, Y . Du, J. Hu, Y . Liu, G. Li, X. Wan, and T.-H. Chang, “Multi-modal masked autoencoders for medical vision-and-language pre-training,” inMICCAI. Springer, 2022, pp. 679–689

  9. [9]

    Llava-med: Training a large language-and-vision assistant for biomedicine in one day,

    C. Li, C. Wong, S. Zhang, N. Usuyama, H. Liu, J. Yang, T. Naumann, H. Poon, and J. Gao, “Llava-med: Training a large language-and-vision assistant for biomedicine in one day,”NeurIPS, vol. 36, pp. 28 541– 28 564, 2023

  10. [10]

    Chest x-ray image classification: a causal perspective,

    W. Nie, C. Zhang, D. Song, Y . Bai, K. Xie, and A.-A. Liu, “Chest x-ray image classification: a causal perspective,” inMICCAI. Springer, 2023, pp. 25–35

  11. [11]

    Cross-modal causal relational reasoning for event-level visual question answering,

    Y . Liu, G. Li, and L. Lin, “Cross-modal causal relational reasoning for event-level visual question answering,”IEEE TPAMI, vol. 45, pp. 11 624–11 641, 2022

  12. [12]

    Mitigating inherent bias of answer heuristic based frameworks in knowledge-based visual question answering,

    J. Lu, M. Jiang, J. Kong, D. Zhuang, and M. Lu, “Mitigating inherent bias of answer heuristic based frameworks in knowledge-based visual question answering,”IEEE TMM, vol. 28, pp. 1744–1755, 2026

  13. [13]

    Enhanced reasoning via multimodal llms and collaborative inference,

    Z. Wen, M. Tan, Y . Wang, Q. Wu, and Q. Wu, “Enhanced reasoning via multimodal llms and collaborative inference,”IEEE TMM, vol. 27, pp. 7166–7178, 2025

  14. [14]

    Causal inference in statistics: An overview,

    J. Pearl, “Causal inference in statistics: An overview,”Statistics Surveys, vol. 3, pp. 96–146, 2009

  15. [15]

    Causal attention for vision- language tasks,

    X. Yang, H. Zhang, G. Qi, and J. Cai, “Causal attention for vision- language tasks,” inCVPR, 2021, pp. 9847–9857

  16. [16]

    Open-ended medical visual question answering through prefix tuning of language models,

    T. Van Sonsbeek, M. M. Derakhshani, I. Najdenkoska, C. G. Snoek, and M. Worring, “Open-ended medical visual question answering through prefix tuning of language models,” inMICCAI. Springer, 2023, pp. 726–736

  17. [17]

    Pmc-clip: Contrastive language-image pre-training using biomedical documents,

    W. Lin, Z. Zhao, X. Zhang, C. Wu, Y . Zhang, Y . Wang, and W. Xie, “Pmc-clip: Contrastive language-image pre-training using biomedical documents,” inMICCAI. Springer, 2023, pp. 525–536

  18. [18]

    Overcoming data limitation in medical visual question answering,

    B. D. Nguyen, T.-T. Do, B. X. Nguyen, T. Do, E. Tjiputra, and Q. D. Tran, “Overcoming data limitation in medical visual question answering,” inMICCAI. Springer, 2019, pp. 522–530

  19. [19]

    Contrastive pre-training and representation distillation for medical visual question answering based on radiology images,

    B. Liu, L.-M. Zhan, and X.-M. Wu, “Contrastive pre-training and representation distillation for medical visual question answering based on radiology images,” inMICCAI. Springer, 2021, pp. 210–220

  20. [20]

    Self-supervised vision- language pretraining for medial visual question answering,

    P. Li, G. Liu, L. Tan, J. Liao, and S. Zhong, “Self-supervised vision- language pretraining for medial visual question answering,” in2023 IEEE 20th International Symposium on Biomedical Imaging (ISBI). IEEE, 2023, pp. 1–5

  21. [21]

    Amam: an attention-based multimodal alignment model for medical visual question answering,

    H. Pan, S. He, K. Zhang, B. Qu, C. Chen, and K. Shi, “Amam: an attention-based multimodal alignment model for medical visual question answering,”Knowledge-Based Systems, vol. 255, p. 109763, 2022

  22. [22]

    A dual-attention learning network with word and sentence embedding for medical visual question answering,

    X. Huang and H. Gong, “A dual-attention learning network with word and sentence embedding for medical visual question answering,”IEEE Transactions on Medical Imaging, vol. 43, no. 2, pp. 832–845, 2023

  23. [23]

    Localized questions in medical visual question answering,

    S. Tascon-Morales, P. M ´arquez-Neila, and R. Sznitman, “Localized questions in medical visual question answering,” inMICCAI. Springer, 2023, pp. 361–370

  24. [24]

    Motor: Multimodal optimal transport via grounded retrieval in medical visual question answering,

    M. A. Shaaban, T. J. Saleem, V . R. K. Papineni, and M. Yaqub, “Motor: Multimodal optimal transport via grounded retrieval in medical visual question answering,” inMICCAI. Springer, 2025, pp. 459–469

  25. [25]

    Causal interven- tion for weakly-supervised semantic segmentation,

    D. Zhang, H. Zhang, J. Tang, X.-S. Hua, and Q. Sun, “Causal interven- tion for weakly-supervised semantic segmentation,” inNeurIPS, vol. 33. Curran Associates, Inc., 2020, pp. 655–666

  26. [26]

    Multi-label chest x-ray image classification via category disentangled causal learning,

    Q. Li, M. Liu, R. Chang, W. Nie, S. Bai, and A. Liu, “Multi-label chest x-ray image classification via category disentangled causal learning,” IEEE Transactions on Artificial Intelligence, 2025

  27. [27]

    Cross- modal causal representation learning for radiology report generation,

    W. Chen, Y . Liu, C. Wang, J. Zhu, G. Li, C.-L. Liu, and L. Lin, “Cross- modal causal representation learning for radiology report generation,” IEEE TIP, vol. 34, pp. 2970–2985, 2025

  28. [28]

    Debiasing medical visual question answering via counterfactual training,

    C. Zhan, P. Peng, H. Zhang, H. Sun, C. Shang, T. Chen, H. Wang, G. Wang, and H. Wang, “Debiasing medical visual question answering via counterfactual training,” inMICCAI. Springer, 2023, pp. 382–393

  29. [29]

    Eliminating language bias for medical visual question answering with counterfactual contrastive training,

    X. Wan, Q. Teng, J. Chen, Y . Lu, D. Yuan, and Z. Liu, “Eliminating language bias for medical visual question answering with counterfactual contrastive training,” inMICCAI. Springer, 2025, pp. 194–204

  30. [30]

    Counterfactual causal-effect intervention for interpretable medical visual question answering,

    L. Cai, H. Fang, N. Xu, and B. Ren, “Counterfactual causal-effect intervention for interpretable medical visual question answering,”IEEE Transactions on Medical Imaging, vol. 43, no. 12, pp. 4430–4441, 2024

  31. [31]

    Cimb- mvqa: Causal intervention on modality-specific biases for medical visual question answering,

    B. Liu, L. Liu, J. Ding, X. Yang, W. Peng, and L. Liu, “Cimb- mvqa: Causal intervention on modality-specific biases for medical visual question answering,”Medical Image Analysis, p. 103850, 2025

  32. [32]

    Instrumental variable learning for chest x-ray classification,

    W. Nie, C. Zhang, D. Song, Y . Bai, K. Xie, and A. Liu, “Instrumental variable learning for chest x-ray classification,” in2023 IEEE Interna- tional Conference on Systems, Man, and Cybernetics (SMC). IEEE, 2023, pp. 4506–4512

  33. [33]

    Instrumental variable-driven domain generalization with unobserved confounders,

    J. Yuan, X. Ma, R. Xiong, M. Gong, X. Liu, F. Wu, L. Lin, and K. Kuang, “Instrumental variable-driven domain generalization with unobserved confounders,”ACM Transactions on Knowledge Discovery from Data, vol. 17, no. 8, pp. 1–21, 2023

  34. [34]

    Instrumental variables in causal inference and machine learning: A survey,

    A. Wu, K. Kuang, R. Xiong, and F. Wu, “Instrumental variables in causal inference and machine learning: A survey,”ACM Computing Surveys, vol. 57, no. 11, pp. 1–36, 2025

  35. [35]

    Unbiased visual question answering by leveraging instrumental variable,

    Y . Pan, J. Liu, L. Jin, and Z. Li, “Unbiased visual question answering by leveraging instrumental variable,”IEEE TMM, vol. 26, pp. 6648–6662, 2024

  36. [36]

    Long-tailed classification by keeping the good and removing the bad momentum causal effect,

    K. Tang, J. Huang, and H. Zhang, “Long-tailed classification by keeping the good and removing the bad momentum causal effect,” inNeurIPS, vol. 33. Curran Associates, Inc., 2020, pp. 1513–1524

  37. [37]

    An introduction to causal inference,

    J. Pearl, “An introduction to causal inference,”The international journal of biostatistics, vol. 6, no. 2, p. 7, 2010

  38. [38]

    Show, attend and tell: Neural image caption generation with visual attention,

    K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, and Y . Bengio, “Show, attend and tell: Neural image caption generation with visual attention,” inICML. PMLR, 2015, pp. 2048–2057

  39. [39]

    Club: A contrastive log-ratio upper bound of mutual information,

    P. Cheng, W. Hao, S. Dai, J. Liu, Z. Gan, and L. Carin, “Club: A contrastive log-ratio upper bound of mutual information,” inICML. PMLR, 2020, pp. 1779–1788

  40. [40]

    A dataset of clinically generated visual questions and answers about radiology images,

    J. J. Lau, S. Gayen, A. Ben Abacha, and D. Demner-Fushman, “A dataset of clinically generated visual questions and answers about radiology images,”Scientific data, vol. 5, no. 1, pp. 1–10, 2018

  41. [41]

    Slake: A semantically-labeled knowledge-enhanced dataset for medical visual question answering,

    B. Liu, L.-M. Zhan, L. Xu, L. Ma, Y . Yang, and X.-M. Wu, “Slake: A semantically-labeled knowledge-enhanced dataset for medical visual question answering,” in2021 IEEE 18th international symposium on biomedical imaging (ISBI). IEEE, 2021, pp. 1650–1654

  42. [42]

    PathVQA: 30000+ Questions for Medical Visual Question Answering

    X. He, Y . Zhang, L. Mou, E. Xing, and P. Xie, “Pathvqa: 30000+ questions for medical visual question answering,”arXiv preprint arXiv:2003.10286, 2020

  43. [43]

    Decoupled Weight Decay Regularization

    I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” arXiv preprint arXiv:1711.05101, 2017

  44. [44]

    Miss: A generative pre-training and fine-tuning approach for med-vqa,

    J. Chen, D. Yang, Y . Jiang, Y . Lei, and L. Zhang, “Miss: A generative pre-training and fine-tuning approach for med-vqa,” inInternational Conference on Artificial Neural Networks. Springer, 2024, pp. 299–313

  45. [45]

    Grad-cam: Visual explanations from deep networks via gradient-based localization,

    R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra, “Grad-cam: Visual explanations from deep networks via gradient-based localization,” inICCV, 2017, pp. 618–626