pith. machine review for the scientific record. sign in

arxiv: 2602.03151 · v2 · submitted 2026-02-03 · 💻 cs.AI

Enhancing Foundation VLM Robustness to Missing Modality: Scalable Diffusion for Bi-directional Feature Restoration

Pith reviewed 2026-05-16 08:31 UTC · model grok-4.3

classification 💻 cs.AI
keywords vision-language modelsmissing modalitydiffusion modelsfeature restorationmodel robustnessmultimodal learningzero-shot evaluation
0
0 comments X

The pith

An enhanced diffusion model restores missing modality features in vision-language models as a pluggable module without retraining the backbone.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to address the sharp drop in vision-language model performance when one input type such as images or text becomes unavailable. It introduces an enhanced diffusion model trained as an intermediate stage that reconstructs the absent features using guidance from the available modality. Two mechanisms drive the process: dynamic gating that selects useful conditional signals and mutual learning that aligns representations in both directions. Because the original model stays frozen, the approach preserves its broad generalization while improving results on standard benchmarks at different missing rates. If the restoration succeeds, models can handle incomplete real-world inputs more reliably without custom retraining for each scenario.

Core claim

The authors claim that an enhanced diffusion model can act as a mid-stage pluggable component to recover missing modality features. Dynamic Modality Gating adaptively conditions generation on available inputs to keep restored features semantically consistent. Cross-Modal Mutual Learning aligns the semantic spaces of the two modalities bidirectionally. This combination allows precise feature restoration while leaving the pre-trained vision-language model untouched, and zero-shot tests show consistent gains over prompt-based and imputation baselines across varying missing rates and datasets.

What carries the argument

The enhanced diffusion model with Dynamic Modality Gating for adaptive conditional guidance and Cross-Modal Mutual Learning for bidirectional semantic alignment, inserted as a separate training module between modalities.

If this is right

  • Vision-language models equipped with the module sustain performance across low to high missing rates without retraining the core network.
  • Bi-directional alignment prevents the generation of semantically irrelevant noise that would otherwise degrade generalization.
  • The method functions as a general add-on applicable to multiple existing vision-language architectures.
  • Zero-shot evaluations confirm the approach scales to diverse datasets and missing conditions while keeping the original model integrity intact.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same gated diffusion restoration could be tested on other multimodal systems that combine vision with audio or sensor streams.
  • In deployment settings with intermittent sensor failure, the module might reduce the need for separate error-handling logic inside each model.
  • One could measure whether the learned alignment transfers to new modality pairs not seen during the diffusion training stage.

Load-bearing premise

The diffusion model produces features that remain semantically relevant to the available modality and improve rather than harm the downstream vision-language model tasks, even without any adjustment to the original backbone parameters.

What would settle it

A controlled test in which the restored features are fed to the vision-language model and accuracy on a held-out benchmark falls below the accuracy obtained by simply dropping the missing modality or using basic mean imputation.

Figures

Figures reproduced from arXiv: 2602.03151 by Fan Li, Haixia Bi, Haoyu Wang, Honghao Chang, Jian Sun, Lijun He, Wei Dai.

Figure 1
Figure 1. Figure 1: An overview of Multimodal Missing Modalitiy: [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: PCA results across 1,000 samples show that our [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The architecture of our Missing Modality Restoration Framework. (a) Feature extraction process for available [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: The visualization of Dynamic Modality Gating: The [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Robustness analysis on the MM-IMDb dataset across various missing rates in terms of F1-M Score. to 3M (CC3M [44]). For instance, the accuracy on MMHS11K under the image-missing scenario improves from 85.59% to 86.47%. These results validate that the framework scales effectively with increased data volume, leading to enhanced performance. 4.5.2 Number of DiT Layers. To investigate the impact of model scale,… view at source ↗
Figure 7
Figure 7. Figure 7: Category-wise cosine similarity for 1,000 features [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Comparison of inference efficiency and computa [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗
Figure 6
Figure 6. Figure 6: T-SNE analysis of restored features on the [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
read the original abstract

Vision Language Model (VLM) typically assume complete modality input during inference. However, their effectiveness drops sharply when certain modalities are unavailable or incomplete. Current research on missing modality primarily faces two dilemmas: Prompt-based methods struggle to restore missing yet indispensable features and degrade the generalizability of VLM. Imputation-based approaches, lacking effective guidance, are prone to generating semantically irrelevant noise. Restoring precise semantics while sustaining VLM's generalization remains challenging. Therefore, we propose a general missing modality restoration strategy in this paper. We introduce an enhanced diffusion model as a pluggable mid-stage training module to effectively restore missing features. Our strategy introduces two key innovations: (I) Dynamic Modality Gating, which adaptively leverages conditional features to guide the generation of semantically consistent features; (II) Cross-Modal Mutual Learning mechanism, which bridges the semantic spaces of the dual models to achieve bi-directional alignment. Notably, our strategy maintains the original integrity of the pre-trained VLM, requiring no fine-tuning of the backbone models while significantly boosting resilience to information loss. Zero-shot evaluations across benchmark datasets demonstrate that our approach consistently outperforms existing baselines, establishing it as a robust and scalable extension that ensures VLM reliability across diverse missing rates and conditions. Our code and models will be publicly available.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper proposes an enhanced diffusion model as a pluggable mid-stage training module to restore missing modalities in pre-trained Vision-Language Models (VLMs). Key innovations include Dynamic Modality Gating to adaptively guide feature generation from conditional inputs and a Cross-Modal Mutual Learning mechanism for bi-directional semantic alignment between modalities. The approach requires no fine-tuning of the VLM backbone and is evaluated in zero-shot settings across benchmark datasets, claiming consistent outperformance over existing prompt-based and imputation baselines under varying missing rates.

Significance. If the empirical results hold, the work provides a scalable, non-invasive way to improve VLM reliability in incomplete-modality scenarios without retraining large backbones, which is valuable for deployment in real-world settings such as robotics or medical imaging. The pluggable design, explicit avoidance of backbone fine-tuning, and commitment to releasing code and models are strengths that support reproducibility and adoption.

major comments (1)
  1. [Abstract] The central empirical claim of consistent outperformance (Abstract) rests on zero-shot evaluations, yet the provided text supplies no quantitative metrics, specific baseline implementations, ablation studies on the gating or mutual-learning components, or error bars; this prevents verification of the magnitude and robustness of the reported gains.
minor comments (2)
  1. [Method] Clarify the precise architecture of the 'enhanced diffusion model' (e.g., how it differs from standard conditional diffusion) and provide a diagram of the Dynamic Modality Gating mechanism in the method section.
  2. [Method] The phrase 'bi-directional alignment' is used without an explicit definition or loss formulation; add the corresponding equation or pseudocode to avoid ambiguity.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback on our manuscript. We appreciate the emphasis on empirical clarity and have addressed the concern by planning targeted revisions to strengthen the presentation of results.

read point-by-point responses
  1. Referee: [Abstract] The central empirical claim of consistent outperformance (Abstract) rests on zero-shot evaluations, yet the provided text supplies no quantitative metrics, specific baseline implementations, ablation studies on the gating or mutual-learning components, or error bars; this prevents verification of the magnitude and robustness of the reported gains.

    Authors: We agree that the abstract would benefit from explicit quantitative support to allow immediate assessment of the claimed gains. The full manuscript (Section 4) already contains detailed zero-shot results across benchmarks, including tables with accuracy metrics under varying missing rates, comparisons to prompt-based and imputation baselines with specific implementation details (e.g., CLIP-based prompts and diffusion imputation variants), ablation studies isolating Dynamic Modality Gating and Cross-Modal Mutual Learning, and error bars from 3-5 runs with standard deviations. To directly address the comment, we will revise the abstract to include key quantitative highlights, such as average accuracy improvements (e.g., +X% on Dataset Y at Z% missing rate). This change will be made without altering the underlying claims or experimental setup. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper introduces an enhanced diffusion model as a pluggable mid-stage module with two explicit new components (Dynamic Modality Gating and Cross-Modal Mutual Learning) to restore missing features while preserving the pre-trained VLM backbone without fine-tuning. All performance claims rest on external zero-shot empirical evaluations across benchmark datasets rather than any internal equations, fitted parameters renamed as predictions, or self-citation chains. No load-bearing step reduces the claimed robustness gains to a self-definition or tautology; the method is additive and independently testable.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on standard diffusion-model assumptions about conditional generation and on the premise that semantic alignment between modalities can be learned without backbone changes.

axioms (1)
  • domain assumption Diffusion models conditioned on partial features can produce semantically consistent completions for missing modalities
    Invoked when claiming the enhanced diffusion restores precise semantics without noise.

pith-pipeline@v0.9.0 · 5546 in / 1068 out tokens · 31737 ms · 2026-05-16T08:31:32.672045+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

67 extracted references · 67 canonical work pages · 3 internal anchors

  1. [1]

    Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al

  2. [2]

    Flamingo: a visual language model for few-shot learning.Advances in neural information processing systems35 (2022), 23716–23736

  3. [3]

    Tomer Amit, Tal Shaharbany, Eliya Nachmani, and Lior Wolf. 2021. Segdiff: Image segmentation with diffusion probabilistic models.arXiv preprint arXiv:2112.00390 (2021)

  4. [4]

    Ruichuan An, Sihan Yang, Ziyu Guo, Wei Dai, Zijun Shen, Haodong Li, Renrui Zhang, Xinyu Wei, Guopeng Li, Wenshan Wu, et al. 2026. Genius: Generative fluid intelligence evaluation suite.arXiv preprint arXiv:2602.11144(2026)

  5. [5]

    Ruichuan An, Sihan Yang, Renrui Zhang, Zijun Shen, Ming Lu, Gaole Dai, Hao Liang, Ziyu Guo, Shilin Yan, Yulin Luo, et al. 2025. Unictokens: Boosting person- alized understanding and generation via unified concept tokens.arXiv preprint arXiv:2505.14671(2025)

  6. [6]

    John Arevalo, Thamar Solorio, Manuel Montes-y Gómez, and Fabio A González

  7. [7]

    Gated multimodal units for information fusion.arXiv preprint arXiv:1702.01992(2017)

  8. [8]

    Tadas Baltrušaitis, Chaitanya Ahuja, and Louis-Philippe Morency. 2018. Multi- modal machine learning: A survey and taxonomy.IEEE transactions on pattern analysis and machine intelligence41, 2 (2018), 423–443

  9. [9]

    Jason Becker, Chris Wendler, Peter Baylies, Robert West, and Christian Wress- negger. 2025. Controlling Latent Diffusion Using Latent CLIP.arXiv preprint arXiv:2503.08455(2025)

  10. [10]

    Lukas Bossard, Matthieu Guillaumin, and Luc Van Gool. 2014. Food-101 – Mining Discriminative Components with Random Forests. InEuropean Conference on Computer Vision

  11. [11]

    Jiajun Chen, Sai Cheng, Yutao Yuan, Yirui Zhang, Haitao Yuan, Peng Peng, and Yi Zhong. 2025. PROMISE: Prompt-Attentive Hierarchical Contrastive Learning for Robust Cross-Modal Representation with Missing Modalities.arXiv preprint arXiv:2511.10997(2025)

  12. [12]

    Shoufa Chen, Peize Sun, Yibing Song, and Ping Luo. 2023. Diffusiondet: Diffusion model for object detection. InProceedings of the IEEE/CVF international conference on computer vision. 19830–19843

  13. [13]

    Ruiting Dai, Chenxi Li, Yandong Yan, Lisi Mo, Ke Qin, and Tao He. 2025. Un- biased missing-modality multimodal learning. InProceedings of the IEEE/CVF International Conference on Computer Vision. 24507–24517

  14. [14]

    Ruiting Dai, Yuqiao Tan, Lisi Mo, Tao He, Ke Qin, and Shuang Liang. 2025. RobustPT: Dynamic Disentanglement Prompt Tuning in Vision-Language Models with Missing Modalities. InProceedings of the 2025 International Conference on Multimedia Retrieval. 164–172

  15. [15]

    Prafulla Dhariwal and Alexander Nichol. 2021. Diffusion models beat gans on image synthesis.Advances in neural information processing systems34 (2021), 8780–8794

  16. [16]

    Ian J Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial nets.Advances in neural information processing systems27 (2014)

  17. [17]

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. 2020. Denoising diffusion probabilistic models.Advances in neural information processing systems33 (2020), 6840–6851

  18. [18]

    Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J Fleet. 2022. Video diffusion models.Advances in neu- ral information processing systems35 (2022), 8633–8646

  19. [19]

    Jaehyuk Jang, Yooseung Wang, and Changick Kim. 2024. Towards robust multi- modal prompting with missing modalities. InICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 8070–8074

  20. [20]

    Menglin Jia, Luming Tang, Bor-Chun Chen, Claire Cardie, Serge Belongie, Bharath Hariharan, and Ser-Nam Lim. 2022. Visual prompt tuning. InEuro- pean conference on computer vision. Springer, 709–727

  21. [21]

    Guanzhou Ke, Shengfeng He, Xiaoli Wang, Bo Wang, Guoqing Chao, Yuanyang Zhang, Yi Xie, and Hexing Su. 2025. Knowledge bridger: Towards training-free missing modality completion. InProceedings of the Computer Vision and Pattern Recognition Conference. 25864–25873

  22. [22]

    Aghiles Kebaili, Jérôme Lapuyade-Lahorgue, Pierre Vera, and Su Ruan. 2025. Amm-diff: Adaptive multi-modality diffusion network for missing modality im- putation. In2025 IEEE 22nd International Symposium on Biomedical Imaging (ISBI). IEEE, 1–4

  23. [23]

    Donggeun Kim and Taesup Kim. 2024. Missing modality prediction for un- paired multimodal learning via joint embedding of unimodal models. InEuropean Conference on Computer Vision. Springer, 171–187

  24. [24]

    Diederik P Kingma and Max Welling. 2013. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114(2013)

  25. [25]

    Jian Lang, Zhangtao Cheng, Ting Zhong, and Fan Zhou. 2025. Retrieval- augmented dynamic prompt tuning for incomplete multimodal learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 39. 18035–18043

  26. [26]

    Jian Lang, Rongpei Hong, Zhangtao Cheng, Ting Zhong, Yong Wang, and Fan Zhou. 2025. REDEEMing Modality Information Loss: Retrieval-Guided Con- ditional Generation for Severely Modality Missing Learning. InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V. 2. 1241–1252

  27. [27]

    Jeong Ryong Lee, Yejee Shin, Geonhui Son, and Dosik Hwang. 2025. Diffusion bridge: leveraging diffusion model to reduce the modality gap between text and vision for zero-shot image captioning. InProceedings of the Computer Vision and Pattern Recognition Conference. 4050–4059

  28. [28]

    Yi-Lun Lee, Yi-Hsuan Tsai, Wei-Chen Chiu, and Chen-Yu Lee. 2023. Multimodal prompting with missing modalities for visual recognition. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 14943–14952

  29. [29]

    Alexander C Li, Mihir Prabhudesai, Shivam Duggal, Ellis Brown, and Deepak Pathak. 2023. Your diffusion model is secretly a zero-shot classifier. InProceedings of the IEEE/CVF International Conference on Computer Vision. 2206–2217

  30. [30]

    Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. 2023. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InInternational conference on machine learning. PMLR, 19730–19742

  31. [31]

    Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. 2022. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. InInternational conference on machine learning. PMLR, 12888–12900

  32. [32]

    Liunian Harold Li, Pengchuan Zhang, Haotian Zhang, Jianwei Yang, Chunyuan Li, Yiwu Zhong, Lijuan Wang, Lu Yuan, Lei Zhang, Jenq-Neng Hwang, et al. 2022. Grounded language-image pre-training. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 10965–10975

  33. [33]

    Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. 2014. Microsoft coco: Common objects in context. InEuropean conference on computer vision. Springer, 740–755

  34. [34]

    Weifeng Lin, Xinyu Wei, Ruichuan An, Peng Gao, Bocheng Zou, Yulin Luo, Siyuan Huang, Shanghang Zhang, and Hongsheng Li. 2024. Draw-and-understand: Leveraging visual prompts to enable mllms to comprehend what you want.arXiv preprint arXiv:2403.20271(2024)

  35. [35]

    Weifeng Lin, Xinyu Wei, Ruichuan An, Tianhe Ren, Tingwei Chen, Renrui Zhang, Ziyu Guo, Wentao Zhang, Lei Zhang, and Hongsheng Li. 2025. Perceive anything: Recognize, explain, caption, and segment anything in images and videos.arXiv preprint arXiv:2506.05302(2025)

  36. [36]

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. 2023. Visual in- struction tuning.Advances in neural information processing systems36 (2023), 34892–34916

  37. [37]

    Mengmeng Ma, Jian Ren, Long Zhao, Davide Testuggine, and Xi Peng. 2022. Are multimodal transformers robust to missing modality?. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 18177–18186

  38. [38]

    Mengmeng Ma, Jian Ren, Long Zhao, Sergey Tulyakov, Cathy Wu, and Xi Peng

  39. [39]

    InProceedings of the AAAI conference on artificial intelligence, Vol

    Smil: Multimodal learning with severely missing modality. InProceedings of the AAAI conference on artificial intelligence, Vol. 35. 2302–2310

  40. [40]

    Xiangxi Meng, Kaicong Sun, Jun Xu, Xuming He, and Dinggang Shen. 2024. Multi-modal modality-masked diffusion network for brain mri synthesis with random modality missing.IEEE Transactions on Medical Imaging43, 7 (2024), 2587–2598

  41. [41]

    William Peebles and Saining Xie. 2023. Scalable diffusion models with transform- ers. InProceedings of the IEEE/CVF international conference on computer vision. 4195–4205

  42. [42]

    Vittorio Pipoli, Alessia Saporita, Federico Bolelli, Marcella Cornia, Lorenzo Baraldi, Costantino Grana, Rita Cucchiara, and Elisa Ficarra. 2025. Missrag: Addressing the missing modality challenge in multimodal large language mod- els. InProceedings of the IEEE/CVF International Conference on Computer Vision. 3215–3224

  43. [43]

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. InInternational conference on machine learning. PmLR, 8748–8763

  44. [44]

    Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen

  45. [45]

    Hierarchical text-conditional image generation with clip latents.arXiv preprint arXiv:2204.061251, 2 (2022), 3

  46. [46]

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. 2022. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 10684–10695

  47. [47]

    Furqan Khan Saddozai, Sahar K Badri, Daniyal Alghazzawi, Asad Khattak, and Muhammad Zubair Asghar. 2025. Multimodal hate speech detection: a novel deep learning framework for multilingual text and images.PeerJ Computer Science11 (2025), e2801

  48. [48]

    Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. 2018. Con- ceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. InProceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2556–2565

  49. [49]

    Tongkai Shi, Wei Feng, Fanhua Shang, Liang Wan, et al. 2024. Deep correlated prompting for visual recognition with missing modalities.Advances in Neural Information Processing Systems37 (2024), 67446–67466

  50. [50]

    Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli

  51. [51]

    In 9 Preprint, April, 2026 International conference on machine learning

    Deep unsupervised learning using nonequilibrium thermodynamics. In 9 Preprint, April, 2026 International conference on machine learning. pmlr, 2256–2265

  52. [52]

    Hu Wang, Yuanhong Chen, Congbo Ma, Jodie Avery, Louise Hull, and Gustavo Carneiro. 2023. Multi-modal learning with missing modality via shared-specific feature modelling. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 15878–15887

  53. [53]

    Yuanzhi Wang, Yong Li, and Zhen Cui. 2023. Incomplete multimodality-diffused emotion recognition.Advances in Neural Information Processing Systems36 (2023), 17117–17128

  54. [54]

    Zhen Wang, Xu Shan, Xiangxie Zhang, and Jie Yang. 2022. N24News: A New Dataset for Multimodal News Classification. InProceedings of the Language Resources and Evaluation Conference. European Language Resources Association, Marseille, France, 6768–6775

  55. [55]

    Huangbiao Xu, Huanqi Wu, Xiao Ke, Junyi Wu, Rui Xu, and Jinglin Xu. 2025. MCMoE: Completing Missing Modalities with Mixture of Experts for Incomplete Multimodal Action Quality Assessment.arXiv preprint arXiv:2511.17397(2025)

  56. [56]

    Cuixin Yang, Rongkang Dong, and Kin-Man Lam. 2025. Vision-Language Model Guided Image Restoration.arXiv preprint arXiv:2512.17292(2025)

  57. [57]

    Sihan Yang, Huitong Ji, Shaolin Lu, Jiayi Chen, Binxiao Xu, Ming Lu, Yuanxing Zhang, Wenhui Dong, and Wentao Zhang. 2025. Small-Large Collaboration: Training-efficient Concept Personalization for Large VLM using a Meta Person- alized Small VLM.arXiv preprint arXiv:2508.07260(2025)

  58. [58]

    Bing Yu, Zhenghui Fan, Xue Xiang, Jiahui Chen, and Dongjin Huang. 2024. Universal Image Restoration with Text Prompt Diffusion.Sensors24, 12 (2024), 3917

  59. [59]

    Wei Zhang, Juan Chen, Yanbo J Wang, En Zhu, Xuan Yang, and Yiduo Wang

  60. [60]

    ADMC: Attention-based Diffusion Model for Missing Modalities Feature Completion.arXiv preprint arXiv:2507.05624(2025)

  61. [61]

    Zhihui Zhang, Luanyuan Dai, Qika Lin, Yunfeng Diao, Guangyin Jin, Yufei Guo, Jing Zhang, and Xiaoshuai Hao. 2025. Synergistic prompting for robust visual recognition with missing modalities. InProceedings of the IEEE/CVF International Conference on Computer Vision. 1881–1890

  62. [62]

    Jinming Zhao, Ruichen Li, and Qin Jin. 2021. Missing modality imagination net- work for emotion recognition with uncertain missing modalities. InProceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 2608–2618

  63. [63]

    Shu Zhao, Nilesh Ahuja, Tan Yu, Tianyi Shen, and Vijaykrishnan Narayanan

  64. [64]

    arXiv preprint arXiv:2511.06225(2025)

    MoRA: Missing Modality Low-Rank Adaptation for Visual Recognition. arXiv preprint arXiv:2511.06225(2025)

  65. [65]

    Shu Zhao, Xiaohan Zou, Tan Yu, and Huijuan Xu. 2024. Reconstruct before query: Continual missing modality learning with decomposed prompt collaboration. arXiv preprint arXiv:2403.11373(2024)

  66. [66]

    Yihan Zhao, Wei Xi, Xiao Fu, and Jizhong Zhao. 2025. Enhancing multimodal model robustness under missing modalities via memory-driven prompt learning. InProceedings of the 34th International Joint Conference on Artificial Intelligence, IJCAI 2025. International Joint Conferences on Artificial Intelligence, 2458–2466

  67. [67]

    Yuanhong Zheng, Ruichuan An, Xiaopeng Lin, Yuxing Liu, Sihan Yang, Huanyu Zhang, Haodong Li, Qintong Zhang, Renrui Zhang, Guopeng Li, et al . 2026. PEARL: Personalized Streaming Video Understanding Model.arXiv preprint arXiv:2603.20422(2026). 10