pith. sign in

arxiv: 2602.16144 · v3 · submitted 2026-02-18 · 💻 cs.CL · cs.LG

Missing-by-Design: Certifiable Modality Deletion for Revocable Multimodal Sentiment Analysis

Pith reviewed 2026-05-15 21:50 UTC · model grok-4.3

classification 💻 cs.CL cs.LG
keywords multimodal sentiment analysismodality deletionmachine unlearningprivacy preservationcertifiable deletionmissing data reconstructionrepresentation learning
0
0 comments X

The pith

Missing-by-Design certifies deletion of specific modalities from multimodal sentiment models via targeted parameter updates.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Missing-by-Design (MBD), a framework that lets multimodal sentiment systems respond to requests for removing particular input types such as audio or video. It first learns embeddings that separate modality properties and trains a generator to reconstruct missing channels while keeping signals useful for the sentiment task. When a deletion request arrives, the method uses saliency to pick influential parameters and applies a calibrated Gaussian update that produces a machine-verifiable certificate of removal. This process supports accurate predictions even when inputs are incomplete. A sympathetic reader would care because it gives a concrete way to honor privacy demands without retraining the entire model each time.

Core claim

MBD establishes that property-aware representation learning combined with generator-based reconstruction and a saliency-driven Gaussian parameter update can produce a machine-verifiable Modality Deletion Certificate confirming removal of modality-specific information, while delivering competitive predictive performance on incomplete inputs and a practical privacy-utility trade-off as an efficient alternative to full retraining.

What carries the argument

The Modality Deletion Certificate generated by saliency-driven candidate selection followed by a calibrated Gaussian update on model parameters, which certifies removal of modality-specific information.

If this is right

  • Multimodal models maintain strong predictive performance when one or more modalities are missing by using the generator-based reconstruction.
  • Deletion requests can be fulfilled through targeted parameter changes without requiring full model retraining.
  • The resulting certificate supplies machine-verifiable proof that modality-specific signals have been removed.
  • A practical privacy-utility balance is achieved on standard benchmark datasets for sentiment analysis.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the certificate holds under scrutiny, similar surgical updates could be adopted for other multimodal tasks where selective removal of input types is required.
  • Regulatory frameworks might eventually mandate such certified deletion mechanisms for handling user requests in deployed multimodal systems.
  • Robustness tests against reconstruction attacks on the updated parameters could be run to check for any undetected residual modality information.

Load-bearing premise

That saliency-driven candidate selection followed by a calibrated Gaussian update produces a machine-verifiable certificate that actually removes all modality-specific information without hidden leakage.

What would settle it

An experiment that trains a separate recovery model on the updated parameters and measures whether it can still predict information from the deleted modality above chance level on held-out test data.

Figures

Figures reproduced from arXiv: 2602.16144 by Chunlei Meng, Hao Zhang, Jiaxuan Lu, Jiekai Wu, Kangan Qian, Rong Fu, Simon Fong, Ziming Wang.

Figure 1
Figure 1. Figure 1: Overview of the Missing-by-Design (MBD) framework for certifiable modality deletion. The architecture [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Privacy–utility trade-off after certified audio deletion. Plotted curves show binary accuracy (Acc2) together [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Training trajectories for the principal loss terms (averaged across three seeds). [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: t-SNE visualization of reconstructed embeddings (left: without property embedding pathway; right: with [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Cumulative privacy budget under sequential modality deletions. The solid curve shows the theoretical [PITH_FULL_IMAGE:figures/full_fig_p017_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: SwiftPrune proxy Lbq versus the true leave-one-out increment ∆Lq. Each point corresponds to a candidate parameter q. The dashed line is y = x and the shaded band indicates ±8% around y = x. Spearman ρ = 0.87 and 98% of points fall inside the ±8% band, which visually confirms that the proxy tracks the true increments closely and does not systematically over-estimate them. A.9 A.13 Proof of the pointwise err… view at source ↗
read the original abstract

As multimodal systems increasingly process sensitive personal data, the ability to selectively revoke specific data modalities has become a critical requirement for privacy compliance and user autonomy. We present Missing-by-Design (MBD), a unified framework for revocable multimodal sentiment analysis that combines structured representation learning with a certifiable parameter-modification pipeline. Revocability is critical in privacy-sensitive applications where users or regulators may request removal of modality-specific information. MBD learns property-aware embeddings and employs generator-based reconstruction to recover missing channels while preserving task-relevant signals. For deletion requests, the framework applies saliency-driven candidate selection and a calibrated Gaussian update to produce a machine-verifiable Modality Deletion Certificate. Experiments on benchmark datasets show that MBD achieves strong predictive performance under incomplete inputs and delivers a practical privacy-utility trade-off, positioning surgical unlearning as an efficient alternative to full retraining.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces Missing-by-Design (MBD), a unified framework for revocable multimodal sentiment analysis. It combines property-aware embeddings and generator-based reconstruction to handle missing modalities while preserving task signals, and for deletion requests applies saliency-driven candidate selection followed by a calibrated Gaussian parameter update to generate a machine-verifiable Modality Deletion Certificate. Experiments on benchmark datasets are claimed to show strong predictive performance under incomplete inputs together with a practical privacy-utility trade-off, positioning the method as an efficient surgical-unlearning alternative to full retraining.

Significance. If the Modality Deletion Certificate can be shown to eliminate modality-specific information without residual leakage, the framework would supply a concrete, efficient mechanism for user-driven modality revocation in privacy-sensitive multimodal systems. This would be a meaningful contribution to certifiable unlearning, especially for applications such as sentiment analysis that process entangled personal data. The reported efficiency gains over retraining would be a clear practical advantage once the soundness claims are substantiated.

major comments (3)
  1. Abstract: The assertion that the calibrated Gaussian update produces a 'machine-verifiable Modality Deletion Certificate' is unsupported by any equations, proof sketch, or bound on residual mutual information; without such a derivation the certifiability claim cannot be evaluated.
  2. Method section (saliency-driven candidate selection and Gaussian update): No argument is supplied showing that the update eliminates cross-modal correlations typical in sentiment analysis; the procedure may leave predictive information in the retained embedding space that the certificate does not detect.
  3. Experiments section: No quantitative results on certificate soundness (e.g., post-deletion mutual-information estimates, modality-specific probe accuracy, or leakage metrics) are reported, leaving the central privacy guarantee unverified.
minor comments (2)
  1. Abstract: The acronym 'MBD' is introduced without an explicit expansion on first use.
  2. Notation: The term 'property-aware embeddings' is used without a formal definition or reference to the precise loss terms that enforce the property.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their insightful comments, which help us improve the clarity and rigor of our work on Missing-by-Design. We address each major comment below, proposing revisions to substantiate the certifiability claims.

read point-by-point responses
  1. Referee: Abstract: The assertion that the calibrated Gaussian update produces a 'machine-verifiable Modality Deletion Certificate' is unsupported by any equations, proof sketch, or bound on residual mutual information; without such a derivation the certifiability claim cannot be evaluated.

    Authors: We agree that the abstract's claim requires stronger theoretical support. In the revised manuscript, we will expand the Method section with a formal derivation of the Modality Deletion Certificate, including a proof sketch that the calibrated Gaussian update bounds the residual mutual information between the deleted modality and the model parameters to a negligible level, thereby making the certificate machine-verifiable through verification of the update parameters. revision: yes

  2. Referee: Method section (saliency-driven candidate selection and Gaussian update): No argument is supplied showing that the update eliminates cross-modal correlations typical in sentiment analysis; the procedure may leave predictive information in the retained embedding space that the certificate does not detect.

    Authors: The saliency-driven candidate selection identifies parameters with high influence on modality-specific predictions, and the subsequent Gaussian update is designed to perturb these parameters in a way that disrupts cross-modal correlations. While we believe the procedure achieves this based on the property-aware embeddings, we acknowledge the lack of an explicit argument. We will add a theoretical analysis in the revision demonstrating that the update reduces cross-modal mutual information, with the certificate serving as verification of this reduction. revision: partial

  3. Referee: Experiments section: No quantitative results on certificate soundness (e.g., post-deletion mutual-information estimates, modality-specific probe accuracy, or leakage metrics) are reported, leaving the central privacy guarantee unverified.

    Authors: We concur that empirical evidence for the certificate's soundness is essential. The current experiments focus on predictive performance and efficiency, but we will include additional results in the revised version, such as post-deletion mutual information estimates between modalities and probe classifier accuracies for modality-specific information, to quantify the leakage and validate the privacy guarantees. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the MBD derivation chain

full rationale

The paper describes a framework that learns property-aware embeddings, uses generator-based reconstruction for missing channels, and applies saliency-driven candidate selection plus calibrated Gaussian update to generate a Modality Deletion Certificate. No equations or steps in the provided description reduce the certificate or the claimed removal of modality-specific information to a quantity defined by the same fitted parameters or by self-citation chains that bear the central load. The privacy-utility claims rest on experimental results on benchmark datasets rather than self-referential definitions or imported uniqueness theorems, leaving the derivation self-contained against external validation.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The framework assumes standard properties of Gaussian perturbations and saliency maps; no new physical constants or ad-hoc entities beyond the certificate itself are introduced in the abstract.

free parameters (1)
  • Gaussian calibration parameter
    The abstract mentions a 'calibrated Gaussian update' whose scale must be chosen to balance deletion strength and task performance.
axioms (1)
  • domain assumption Saliency scores accurately identify parameters carrying modality-specific information
    Invoked when selecting candidates for the deletion update.
invented entities (1)
  • Modality Deletion Certificate no independent evidence
    purpose: Machine-verifiable proof that a chosen modality has been removed
    New ledger entry whose validity is asserted but not derived in the abstract.

pith-pipeline@v0.9.0 · 5467 in / 1274 out tokens · 13895 ms · 2026-05-15T21:50:06.789945+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. EGAD: Entropy-Guided Adaptive Distillation for Token-Level Knowledge Transfer

    cs.CL 2026-05 unverdicted novelty 5.0

    EGAD adaptively distills LLM knowledge at the token level by using entropy to create a curriculum from low- to high-entropy tokens, adjust temperature, and switch between logits-only and feature-based branches.

Reference graph

Works this paper leans on

61 extracted references · 61 canonical work pages · cited by 1 Pith paper

  1. [1]

    A systematic literature review on incomplete multimodal learning: techniques and challenges.Systems Science & Control Engineering, 13(1): 2467083, 2025

    Yifan Zhan, Rui Yang, Junxian You, Mengjie Huang, Weibo Liu, and Xiaohui Liu. A systematic literature review on incomplete multimodal learning: techniques and challenges.Systems Science & Control Engineering, 13(1): 2467083, 2025

  2. [2]

    Found in translation: Learning robust joint representations by cyclic translations between modalities

    Hai Pham, Paul Pu Liang, Thomas Manzini, Louis-Philippe Morency, and Barnabás Póczos. Found in translation: Learning robust joint representations by cyclic translations between modalities. InProceedings of the AAAI conference on artificial intelligence, volume 33, pages 6892–6899, 2019

  3. [3]

    Multimodal and multi-view models for emotion recognition

    Gustavo Aguilar, Viktor Rozgic, Weiran Wang, and Chao Wang. Multimodal and multi-view models for emotion recognition. InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 991–1002, 2019

  4. [4]

    Enhancing sentence representation with visually-supervised multimodal pre-training

    Zhe Li, Laurence T Yang, Xin Nie, BoCheng Ren, and Xianjun Deng. Enhancing sentence representation with visually-supervised multimodal pre-training. InProceedings of the 31st ACM International Conference on Multimedia, pages 5686–5695, 2023

  5. [5]

    Incomplete multimodality-diffused emotion recognition.Advances in Neural Information Processing Systems, 36:17117–17128, 2023

    Yuanzhi Wang, Yong Li, and Zhen Cui. Incomplete multimodality-diffused emotion recognition.Advances in Neural Information Processing Systems, 36:17117–17128, 2023

  6. [6]

    Gcnet: Graph completion network for incomplete multimodal learning in conversation.IEEE Transactions on pattern analysis and machine intelligence, 45(7): 8419–8432, 2023

    Zheng Lian, Lan Chen, Licai Sun, Bin Liu, and Jianhua Tao. Gcnet: Graph completion network for incomplete multimodal learning in conversation.IEEE Transactions on pattern analysis and machine intelligence, 45(7): 8419–8432, 2023

  7. [7]

    Grmi: Graph representation learning of multimodal data with incompleteness

    Xian Xu, Xiao Xu, Xiang Li, and Guotong Xie. Grmi: Graph representation learning of multimodal data with incompleteness. InInternational Conference on Database Systems for Advanced Applications, pages 286–296. Springer, 2023

  8. [8]

    Ada2i: Enhancing modality balance for multimodal conversational emotion recognition

    Cam-Van Thi Nguyen, The-Son Le, Anh-Tuan Mai, and Duc-Trong Le. Ada2i: Enhancing modality balance for multimodal conversational emotion recognition. InProceedings of the 32nd ACM International Conference on Multimedia, pages 9330–9339, 2024

  9. [9]

    Patient-centered and practical privacy to support ai for healthcare

    Ruixuan Liu, Hong Kyu Lee, Sivasubramanium V Bhavani, Xiaoqian Jiang, Lucila Ohno-Machado, and Li Xiong. Patient-centered and practical privacy to support ai for healthcare. In2024 IEEE 6th International Conference on Trust, Privacy and Security in Intelligent Systems, and Applications (TPS-ISA), pages 265–272. IEEE, 2024

  10. [10]

    A survey on security and privacy of large multimodal deep learning models: Teaching and learning perspective

    Md Abdur Rahman, Lamyaa Alqahtani, Amna Albooq, and Alaa Ainousah. A survey on security and privacy of large multimodal deep learning models: Teaching and learning perspective. In2024 21st Learning and Technology Conference (L&T), pages 13–18. IEEE, 2024

  11. [11]

    Privacy protection in deep multi-modal retrieval

    Peng-Fei Zhang, Yang Li, Zi Huang, and Hongzhi Yin. Privacy protection in deep multi-modal retrieval. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 634–643, 2021

  12. [12]

    Affective computing and emotional data: Challenges and implications in privacy regulations, the ai act, and ethics in large language models.arXiv preprint arXiv:2509.20153, 2025

    Nicola Fabiano. Affective computing and emotional data: Challenges and implications in privacy regulations, the ai act, and ethics in large language models.arXiv preprint arXiv:2509.20153, 2025

  13. [13]

    Unimse: Towards unified multimodal sentiment analysis and emotion recognition.arXiv preprint arXiv:2211.11256, 2022

    Guimin Hu, Ting-En Lin, Yi Zhao, Guangming Lu, Yuchuan Wu, and Yongbin Li. Unimse: Towards unified multimodal sentiment analysis and emotion recognition.arXiv preprint arXiv:2211.11256, 2022

  14. [14]

    Multi-label multimodal emotion recognition with transformer-based fusion and emotion-level representation learning.Ieee Access, 11: 14742–14751, 2023

    Hoai-Duy Le, Guee-Sang Lee, Soo-Hyung Kim, Seungwon Kim, and Hyung-Jeong Yang. Multi-label multimodal emotion recognition with transformer-based fusion and emotion-level representation learning.Ieee Access, 11: 14742–14751, 2023

  15. [15]

    Memo- cmt: multimodal emotion recognition using cross-modal transformer-based feature fusion.Scientific reports, 15 (1):5473, 2025

    Mustaqeem Khan, Phuong-Nam Tran, Nhat Truong Pham, Abdulmotaleb El Saddik, and Alice Othmani. Memo- cmt: multimodal emotion recognition using cross-modal transformer-based feature fusion.Scientific reports, 15 (1):5473, 2025. 13 Missing-by-Design

  16. [16]

    Cross-modal fine-grained alignment and fusion network for multimodal aspect-based sentiment analysis.Information Processing & Management, 60 (6):103508, 2023

    Luwei Xiao, Xingjiao Wu, Shuwen Yang, Junjie Xu, Jie Zhou, and Liang He. Cross-modal fine-grained alignment and fusion network for multimodal aspect-based sentiment analysis.Information Processing & Management, 60 (6):103508, 2023

  17. [17]

    Pmr: Prototypical modal rebalance for multimodal learning

    Yunfeng Fan, Wenchao Xu, Haozhao Wang, Junxiao Wang, and Song Guo. Pmr: Prototypical modal rebalance for multimodal learning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20029–20038, 2023

  18. [18]

    Enhanced experts with uncertainty- aware routing for multimodal sentiment analysis

    Zixian Gao, Disen Hu, Xun Jiang, Huimin Lu, Heng Tao Shen, and Xing Xu. Enhanced experts with uncertainty- aware routing for multimodal sentiment analysis. InProceedings of the 32nd ACM International Conference on Multimedia, pages 9650–9659, 2024

  19. [19]

    Tmdc: A two-stage modality denoising and complementation framework for multimodal sentiment analysis with missing and noisy modalities.arXiv preprint arXiv:2511.10325, 2025

    Yan Zhuang, Minhao Liu, Yanru Zhang, Jiawen Deng, and Fuji Ren. Tmdc: A two-stage modality denoising and complementation framework for multimodal sentiment analysis with missing and noisy modalities.arXiv preprint arXiv:2511.10325, 2025

  20. [20]

    Msaf-cf: A multimodal sentiment analysis framework based on feature enhancement and cross-fusion.IEEE Access, 2025

    Zhongliang Wei, Ruofan Chen, and Jing Sun. Msaf-cf: A multimodal sentiment analysis framework based on feature enhancement and cross-fusion.IEEE Access, 2025

  21. [21]

    Meta-learning for incomplete multimodal sentiment analysis

    Geng Tu, Tianhao Wu, Xuan Luo, Xi Zeng, Wenjie Li, and Ruifeng Xu. Meta-learning for incomplete multimodal sentiment analysis. InProceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 2911–2915, 2025

  22. [22]

    Proxy-driven robust multimodal sentiment analysis with incomplete data

    Aoqiang Zhu, Min Hu, Xiaohua Wang, Jiaoyun Yang, Yiming Tang, and Ning An. Proxy-driven robust multimodal sentiment analysis with incomplete data. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 22123–22138, 2025

  23. [23]

    A multimodal fusion network for student emotion recognition based on transformer and tensor product

    Ao Xiang, Zongqing Qi, Han Wang, Qin Yang, and Danqing Ma. A multimodal fusion network for student emotion recognition based on transformer and tensor product. In2024 IEEE 2nd International Conference on Sensors, Electronics and Computer Engineering (ICSECE), pages 1–4. IEEE, 2024

  24. [24]

    Learning from the global view: Supervised contrastive learning of multimodal representation.Information Fusion, 100:101920, 2023

    Sijie Mai, Ying Zeng, and Haifeng Hu. Learning from the global view: Supervised contrastive learning of multimodal representation.Information Fusion, 100:101920, 2023

  25. [25]

    Confede: Contrastive feature decomposition for multimodal sentiment analysis

    Jiuding Yang, Yakun Yu, Di Niu, Weidong Guo, and Yu Xu. Confede: Contrastive feature decomposition for multimodal sentiment analysis. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7617–7630, 2023

  26. [26]

    Disentanglement translation network for multimodal sentiment analysis.Information Fusion, 102:102031, 2024

    Ying Zeng, Wenjun Yan, Sijie Mai, and Haifeng Hu. Disentanglement translation network for multimodal sentiment analysis.Information Fusion, 102:102031, 2024

  27. [27]

    Rui Liu, Haolin Zuo, Zheng Lian, Björn W Schuller, and Haizhou Li. Contrastive learning based modality-invariant feature acquisition for robust multimodal emotion recognition with missing modalities.IEEE Transactions on Affective Computing, 15(4):1856–1873, 2024

  28. [28]

    Multimodal sentiment analysis with unimodal label generation and modality decomposition.Information Fusion, 116:102787, 2025

    Linan Zhu, Hongyan Zhao, Zhechao Zhu, Chenwei Zhang, and Xiangjie Kong. Multimodal sentiment analysis with unimodal label generation and modality decomposition.Information Fusion, 116:102787, 2025

  29. [29]

    Hessian-Free Online Certified Unlearn- ing, February 2025

    Xinbao Qiao, Meng Zhang, Ming Tang, and Ermin Wei. Hessian-free online certified unlearning.arXiv preprint arXiv:2404.01712, 2024

  30. [30]

    Single image unlearning: Efficient machine unlearning in multimodal large language models.Advances in Neural Information Processing Systems, 37:35414–35453, 2024

    Jiaqi Li, Qianshan Wei, Chuanyi Zhang, Guilin Qi, Miaozeng Du, Yongrui Chen, Sheng Bi, and Fan Liu. Single image unlearning: Efficient machine unlearning in multimodal large language models.Advances in Neural Information Processing Systems, 37:35414–35453, 2024

  31. [31]

    Multidelete for multimodal machine unlearning

    Jiali Cheng and Hadi Amiri. Multidelete for multimodal machine unlearning. InEuropean Conference on Computer Vision, pages 165–184. Springer, 2024

  32. [32]

    Certified minimax unlearning with generalization rates and deletion capacity.Advances in Neural Information Processing Systems, 36:62821–62852, 2023

    Jiaqi Liu, Jian Lou, Zhan Qin, and Kui Ren. Certified minimax unlearning with generalization rates and deletion capacity.Advances in Neural Information Processing Systems, 36:62821–62852, 2023

  33. [33]

    Gaussian certified unlearning in high dimensions: A hypothesis testing approach.arXiv preprint arXiv:2510.13094, 2025

    Aaradhya Pandey, Arnab Auddy, Haolin Zou, Arian Maleki, and Sanjeev Kulkarni. Gaussian certified unlearning in high dimensions: A hypothesis testing approach.arXiv preprint arXiv:2510.13094, 2025. 14 Missing-by-Design

  34. [34]

    Modality-aware neuron pruning for unlearning in multimodal large language models.arXiv preprint arXiv:2502.15910, 2025

    Zheyuan Liu, Guangyao Dou, Xiangchi Yuan, Chunhui Zhang, Zhaoxuan Tan, and Meng Jiang. Modality-aware neuron pruning for unlearning in multimodal large language models.arXiv preprint arXiv:2502.15910, 2025

  35. [35]

    Protecting privacy in multimodal large language models with mllmu-bench

    Zheyuan Liu, Guangyao Dou, Mengzhao Jia, Zhaoxuan Tan, Qingkai Zeng, Yongle Yuan, and Meng Jiang. Protecting privacy in multimodal large language models with mllmu-bench. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 4...

  36. [36]

    Practical membership inference attacks against large-scale multi-modal models: A pilot study

    Myeongseob Ko, Ming Jin, Chenguang Wang, and Ruoxi Jia. Practical membership inference attacks against large-scale multi-modal models: A pilot study. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 4871–4881, 2023

  37. [37]

    Black-box adversarial attack on vision language models for autonomous driving.arXiv preprint arXiv:2501.13563, 2025

    Lu Wang, Tianyuan Zhang, Yang Qu, Siyuan Liang, Yuwei Chen, Aishan Liu, Xianglong Liu, and Dacheng Tao. Black-box adversarial attack on vision language models for autonomous driving.arXiv preprint arXiv:2501.13563, 2025

  38. [38]

    Can textual unlearning solve cross-modality safety alignment? InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 9830–9844, 2024

    Trishna Chakraborty, Erfan Shayegani, Zikui Cai, Nael B Abu-Ghazaleh, M Salman Asif, Yue Dong, Amit Roy-Chowdhury, and Chengyu Song. Can textual unlearning solve cross-modality safety alignment? InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 9830–9844, 2024

  39. [39]

    Towards benign memory forgetting for selective multimodal large language model unlearning.arXiv preprint arXiv:2511.20196, 2025

    Zhen Zeng, Leijiang Gu, Zhangling Duan, Feng Li, Zenglin Shi, Cees GM Snoek, and Meng Wang. Towards benign memory forgetting for selective multimodal large language model unlearning.arXiv preprint arXiv:2511.20196, 2025

  40. [40]

    User-controlled privacy: Taint, track, and control.Proceedings on Privacy Enhancing Technologies, 2024

    François Hublet, David Basin, and Sr ¯dan Krsti´c. User-controlled privacy: Taint, track, and control.Proceedings on Privacy Enhancing Technologies, 2024

  41. [41]

    Cross-modal privacy-preserving synthesis and mixture-of-experts ensemble for robust asd prediction.Frontiers in Neuroinformatics, 19:1679196, 2025

    J Revathy and Karthiga M. Cross-modal privacy-preserving synthesis and mixture-of-experts ensemble for robust asd prediction.Frontiers in Neuroinformatics, 19:1679196, 2025

  42. [42]

    Privacy-preserving multimodal sentiment analysis.IEEE Internet of Things Journal, 2025

    Honghui Xu, Wei Li, Daniel Takabi, Daehee Seo, and Zhipeng Cai. Privacy-preserving multimodal sentiment analysis.IEEE Internet of Things Journal, 2025

  43. [43]

    Multimodal sentiment intensity analysis in videos: Facial gestures and verbal messages.IEEE Intelligent Systems, 31(6):82–88, 2016

    Amir Zadeh, Rowan Zellers, Eli Pincus, and Louis-Philippe Morency. Multimodal sentiment intensity analysis in videos: Facial gestures and verbal messages.IEEE Intelligent Systems, 31(6):82–88, 2016

  44. [44]

    Memory fusion network for multi-view sequential learning

    Amir Zadeh, Paul Pu Liang, Navonil Mazumder, Soujanya Poria, Erik Cambria, and Louis-Philippe Morency. Memory fusion network for multi-view sequential learning. InProceedings of the AAAI conference on artificial intelligence, volume 32, 2018

  45. [45]

    Hybrid contrastive learning of tri-modal representation for multimodal sentiment analysis.IEEE Transactions on Affective Computing, 14(3):2276–2289, 2022

    Sijie Mai, Ying Zeng, Shuangjia Zheng, and Haifeng Hu. Hybrid contrastive learning of tri-modal representation for multimodal sentiment analysis.IEEE Transactions on Affective Computing, 14(3):2276–2289, 2022

  46. [46]

    Hydiscgan: A hybrid distributed cgan for audio-visual privacy preservation in multimodal sentiment analysis.arXiv preprint arXiv:2404.11938, 2024

    Zhuojia Wu, Qi Zhang, Duoqian Miao, Kun Yi, Wei Fan, and Liang Hu. Hydiscgan: A hybrid distributed cgan for audio-visual privacy preservation in multimodal sentiment analysis.arXiv preprint arXiv:2404.11938, 2024

  47. [47]

    Clgsi: a multimodal sentiment analysis framework based on contrastive learning guided by sentiment intensity

    Yang Yang, Xunde Dong, and Yupeng Qiang. Clgsi: a multimodal sentiment analysis framework based on contrastive learning guided by sentiment intensity. InFindings of the Association for Computational Linguistics: NAACL 2024, pages 2099–2110, 2024

  48. [48]

    Dlf: Disentangled-language-focused multimodal sentiment analysis

    Pan Wang, Qiang Zhou, Yawen Wu, Tianlong Chen, and Jingtong Hu. Dlf: Disentangled-language-focused multimodal sentiment analysis. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 21180–21188, 2025

  49. [49]

    Pamoe-msa: polarity-aware mixture of experts network for multimodal sentiment analysis.International Journal of Multimedia Information Retrieval, 14(1):1–16, 2025

    Changqin Huang, Zhenheng Lin, Zhongmei Han, Qionghao Huang, Fan Jiang, and Xiaodi Huang. Pamoe-msa: polarity-aware mixture of experts network for multimodal sentiment analysis.International Journal of Multimedia Information Retrieval, 14(1):1–16, 2025

  50. [50]

    Msamba: Exploring multimodal sentiment analysis with state space models

    Xilin He, Haijian Liang, Boyi Peng, Weicheng Xie, Muhammad Haris Khan, Siyang Song, and Zitong Yu. Msamba: Exploring multimodal sentiment analysis with state space models. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 1309–1317, 2025

  51. [51]

    Iemocap: Interactive emotional dyadic motion capture database

    Carlos Busso, Murtaza Bulut, Chi-Chun Lee, Abe Kazemzadeh, Emily Mower, Samuel Kim, Jeannette N Chang, Sungbok Lee, and Shrikanth S Narayanan. Iemocap: Interactive emotional dyadic motion capture database. Language resources and evaluation, 42(4):335–359, 2008. 15 Missing-by-Design

  52. [52]

    Two-stage finetuning of wav2vec 2.0 for speech emotion recognition with asr and gender pretraining

    Yuan Gao, Chenhui Chu, and Tatsuya Kawahara. Two-stage finetuning of wav2vec 2.0 for speech emotion recognition with asr and gender pretraining. InProc. Interspeech, pages 3637–3641, 2023

  53. [53]

    Learning robust self-attention features for speech emotion recognition with label-adaptive mixup

    Lei Kang, Lichao Zhang, and Dazhi Jiang. Learning robust self-attention features for speech emotion recognition with label-adaptive mixup. InICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2023

  54. [54]

    Improving speech emotion recognition with unsupervised speaking style transfer

    Leyuan Qu, Wei Wang, Cornelius Weber, Pengcheng Yue, Taihao Li, and Stefan Wermter. Improving speech emotion recognition with unsupervised speaking style transfer. InICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 10101–10105. IEEE, 2024

  55. [55]

    Leveraging knowledge of modality experts for incomplete multimodal learning

    Wenxin Xu, Hexin Jiang, and Xuefeng Liang. Leveraging knowledge of modality experts for incomplete multimodal learning. InProceedings of the 32nd ACM International Conference on Multimedia, pages 438–446, 2024

  56. [56]

    Apin: Amplitude-and phase-aware interaction network for speech emotion recognition.Speech Communication, 169:103201, 2025

    Lili Guo, Jie Li, Shifei Ding, and Jianwu Dang. Apin: Amplitude-and phase-aware interaction network for speech emotion recognition.Speech Communication, 169:103201, 2025

  57. [57]

    Individual-aware attention modulation for unseen speaker emotion recognition.IEEE Transactions on Affective Computing, 2024

    Yuanbo Fang, Xiaofen Xing, Zhaojie Chu, Yifeng Du, and Xiangmin Xu. Individual-aware attention modulation for unseen speaker emotion recognition.IEEE Transactions on Affective Computing, 2024

  58. [58]

    Gatem 2 former: Gated feature selection and expert modeling in multimodal emotion recognition

    Weixiang Xu, Zhongren Dong, Runming Wang, Xinzhou Xu, and Zixing Zhang. Gatem 2 former: Gated feature selection and expert modeling in multimodal emotion recognition. InICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2025

  59. [59]

    Seenet: A soft emotion expert and data augmentation method to enhance speech emotion recognition.IEEE Transactions on Affective Computing, 2025

    Qifei Li, Yingming Gao, Yuhua Wen, Ziping Zhao, Ya Li, and Björn W Schuller. Seenet: A soft emotion expert and data augmentation method to enhance speech emotion recognition.IEEE Transactions on Affective Computing, 2025

  60. [60]

    Towards robust multimodal sentiment analysis with incomplete data.Advances in Neural Information Processing Systems, 37:55943–55974, 2024

    Haoyu Zhang, Wenbin Wang, and Tianshu Yu. Towards robust multimodal sentiment analysis with incomplete data.Advances in Neural Information Processing Systems, 37:55943–55974, 2024

  61. [61]

    adjacent

    Ramakrishna Vedantam, C Lawrence Zitnick, and Devi Parikh. Cider: Consensus-based image description evaluation. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 4566–4575, 2015. A Proofs and calibration details This appendix collects the full derivation of the DP-like indistinguishability bound used in the paper, supp...