pith. machine review for the scientific record. sign in

arxiv: 2602.16197 · v3 · submitted 2026-02-18 · 💻 cs.LG · cs.CL· cs.MM

Recognition: no theorem link

ModalImmune: Immunity Driven Unlearning via Self Destructive Training

Authors on Pith no claims yet

Pith reviewed 2026-05-15 21:27 UTC · model grok-4.3

classification 💻 cs.LG cs.CLcs.MM
keywords multimodal learningmodality robustnesscollapse regularizationgradient maskinghypergradient adaptationself-destructive traininginformation gain controller
0
0 comments X

The pith

ModalImmune builds resilience in multimodal models by deliberately collapsing selected modality information during training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Multimodal systems often lose reliability when one input channel disappears or gets corrupted at deployment time. The paper introduces ModalImmune to counter this by forcing controlled collapse of modality details while the model trains, so it develops joint representations that do not rely on any single channel. The framework applies a spectrum-adaptive collapse regularizer, an information-gain guided controller to choose interventions, curvature-aware gradient masking to keep updates stable, and a Neumann-truncated hyper-gradient method to adjust meta-parameters automatically. On standard multimodal benchmarks the resulting models handle removal and corruption of modalities more effectively than ordinary training while preserving convergence behavior and reconstruction from complete inputs.

Core claim

ModalImmune enforces modality immunity by intentionally and controllably collapsing selected modality information during training so the model learns joint representations that are robust to destructive modality influence. The framework combines a spectrum-adaptive collapse regularizer, an information-gain guided controller for targeted interventions, curvature-aware gradient masking to stabilize destructive updates, and a certified Neumann-truncated hyper-gradient procedure for automatic meta-parameter adaptation.

What carries the argument

Spectrum-adaptive collapse regularizer with information-gain guided controller, curvature-aware gradient masking, and Neumann-truncated hyper-gradient adaptation to enforce robustness to modality loss.

If this is right

  • Models gain improved resilience to removal or corruption of any input modality on standard benchmarks.
  • Convergence stability remains comparable to ordinary training runs.
  • Reconstruction capacity on full multimodal inputs is preserved.
  • Learned representations become less dependent on any particular modality channel.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same controlled-collapse idea could be tested on tasks with other forms of partial input failure, such as sensor dropout in robotics.
  • The hyper-gradient adaptation component may transfer to stabilizing other destructive regularizers outside multimodal settings.
  • Real-world systems with unreliable data sources could adopt this training style to reduce dependence on hardware redundancy.
  • Extending the information-gain controller to non-modality features might address robustness in single-modality models with missing attributes.

Load-bearing premise

That intentionally collapsing modality information via the regularizer, controller, and masking techniques produces robust joint representations without degrading performance on complete inputs or introducing training instability.

What would settle it

A side-by-side training run on the same multimodal benchmarks where ModalImmune models show lower accuracy on complete inputs or fail to converge at the same rate as standard training would falsify the central claim.

Figures

Figures reproduced from arXiv: 2602.16197 by Jia Yee Tan, Muge Qi, Rong Fu, Shuning Zhang, Simon Fong, WeiZhi Tang, Zhaolu Kang, Zijian Zhang, Ziming Wang.

Figure 1
Figure 1. Figure 1: Overview of the ModalImmune framework, which treats modality destruction as an active causal interven￾tion. The training strategy alternates between standard reconstruction and Self-Destructive Learning (SDL) built on three key components: Info-Drop Intervention (IDI), where an EXP3.P bandit controller leverages an information-gain surrogate ℓm to adaptively select the target modality m⋆ ; Spectral Self-Co… view at source ↗
Figure 2
Figure 2. Figure 2: Training dynamics with explicit phase markers. The horizontal axis shows epochs from 0 to 50. The [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Quantified contribution of principal modules. Bars show absolute drops in validation Acc2 (percentage points) [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: BHGD hyperparameter trajectories versus grid-search baselines. Each subplot shows the online evolution of [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Certified Neumann truncation: error versus compute. The horizontal axis shows truncation depth [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Spectral collapse diagnostics. The top row displays the top-20 singular values for a modality embedding [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: implementation across random seeds. The horizontal axis is epoch and the vertical axis is validation Acc2. [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Corruption robustness comparison. Bars compare ModalImmune and the strongest baseline under three test [PITH_FULL_IMAGE:figures/full_fig_p012_8.png] view at source ↗
read the original abstract

Multimodal systems are vulnerable to partial or complete loss of input channels at deployment, which undermines reliability in real-world settings. This paper presents ModalImmune, a training framework that enforces modality immunity by intentionally and controllably collapsing selected modality information during training so the model learns joint representations that are robust to destructive modality influence. The framework combines a spectrum-adaptive collapse regularizer, an information-gain guided controller for targeted interventions, curvature-aware gradient masking to stabilize destructive updates, and a certified Neumann-truncated hyper-gradient procedure for automatic meta-parameter adaptation. Empirical evaluation on standard multimodal benchmarks demonstrates that ModalImmune improves resilience to modality removal and corruption while retaining convergence stability and reconstruction capacity.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper introduces ModalImmune, a training framework for multimodal models that enforces modality immunity by intentionally collapsing selected modality information during training. It combines a spectrum-adaptive collapse regularizer, an information-gain guided controller, curvature-aware gradient masking, and a certified Neumann-truncated hyper-gradient procedure for meta-parameter adaptation. The central claim is that this produces joint representations robust to modality removal and corruption on standard benchmarks while retaining convergence stability and reconstruction capacity on complete inputs.

Significance. If the no-degradation condition on complete multimodal inputs and the claimed robustness gains can be verified with quantitative evidence, the approach could offer a practical method for improving reliability of multimodal systems in deployment scenarios with missing or corrupted channels. The use of controlled self-destructive training and automatic meta-parameter adaptation is a distinctive technical element that, if substantiated, would distinguish it from standard robustness techniques.

major comments (2)
  1. [Abstract] Abstract: the manuscript asserts that 'empirical evaluation on standard multimodal benchmarks demonstrates that ModalImmune improves resilience... while retaining convergence stability and reconstruction capacity,' yet supplies no quantitative results, baselines, error bars, ablation tables, or comparisons of performance on complete inputs versus intervened training. This directly undermines evaluation of the central no-degradation claim.
  2. [Abstract] The skeptic note correctly identifies that the spectrum-adaptive collapse regularizer, information-gain controller, and curvature-aware masking must be shown not to degrade accuracy or stability on full inputs; without any reported metrics (e.g., accuracy or reconstruction loss on unmodified test sets before/after training), the load-bearing assumption remains unverified.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting the need for stronger substantiation of the central claims in the abstract. We agree that quantitative evidence for the no-degradation condition on complete inputs is essential and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the manuscript asserts that 'empirical evaluation on standard multimodal benchmarks demonstrates that ModalImmune improves resilience... while retaining convergence stability and reconstruction capacity,' yet supplies no quantitative results, baselines, error bars, ablation tables, or comparisons of performance on complete inputs versus intervened training. This directly undermines evaluation of the central no-degradation claim.

    Authors: We acknowledge the validity of this observation. The full manuscript reports empirical results in Section 4, including accuracy and reconstruction metrics on unmodified test sets (with average degradation below 1.5% across benchmarks), resilience improvements under modality removal/corruption, and comparisons to baselines, all with error bars from multiple runs. However, these details are not summarized in the abstract. We will revise the abstract to incorporate key quantitative findings, such as specific accuracy retention figures, robustness gains, and explicit before/after comparisons on complete inputs. revision: yes

  2. Referee: [Abstract] The skeptic note correctly identifies that the spectrum-adaptive collapse regularizer, information-gain controller, and curvature-aware masking must be shown not to degrade accuracy or stability on full inputs; without any reported metrics (e.g., accuracy or reconstruction loss on unmodified test sets before/after training), the load-bearing assumption remains unverified.

    Authors: This point is well-taken and aligns with the first comment. Our experiments include direct ablations and comparisons demonstrating that the components do not degrade performance on full inputs, with convergence curves and reconstruction losses remaining statistically equivalent to standard training. To address the concern, we will expand the abstract revision to explicitly reference these no-degradation metrics and stability indicators from the full evaluation. revision: yes

Circularity Check

0 steps flagged

No circularity: framework is empirical with no load-bearing derivations or self-referential predictions

full rationale

The manuscript describes a training framework (spectrum-adaptive collapse regularizer, information-gain controller, curvature-aware masking, Neumann-truncated hyper-gradient) whose central claims are empirical improvements in resilience while retaining stability. No equations, first-principles derivations, or predictions appear that reduce by construction to fitted inputs or self-citations. The meta-parameter adaptation is presented as a procedural component rather than a renamed fit, and no uniqueness theorems or ansatzes are invoked via self-citation chains. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no equations, parameters, or assumptions are specified, so the ledger cannot be populated with concrete entries.

pith-pipeline@v0.9.0 · 5434 in / 1039 out tokens · 23810 ms · 2026-05-15T21:27:04.787503+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

56 extracted references · 56 canonical work pages

  1. [1]

    Deep multimodal learning with missing modality: A survey.arXiv preprint arXiv:2409.07825, 2024

    Renjie Wu, Hu Wang, Hsiang-Ting Chen, and Gustavo Carneiro. Deep multimodal learning with missing modality: A survey.arXiv preprint arXiv:2409.07825, 2024

  2. [2]

    Multimodal fusion on low-quality data: A comprehensive survey.arXiv preprint arXiv:2404.18947, 2024

    Qingyang Zhang, Yake Wei, Zongbo Han, Huazhu Fu, Xi Peng, Cheng Deng, Qinghua Hu, Cai Xu, Jie Wen, Di Hu, et al. Multimodal fusion on low-quality data: A comprehensive survey.arXiv preprint arXiv:2404.18947, 2024

  3. [3]

    Robust multimodal learning with missing modalities via parameter-efficient adaptation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024

    Md Kaykobad Reza, Ashley Prater-Bennette, and M Salman Asif. Robust multimodal learning with missing modalities via parameter-efficient adaptation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024

  4. [4]

    Simmlm: A simple framework for multi-modal learning with missing modality

    Sijie Li, Chen Chen, and Jungong Han. Simmlm: A simple framework for multi-modal learning with missing modality. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 24068–24077, 2025

  5. [5]

    Rui Liu, Haolin Zuo, Zheng Lian, Björn W Schuller, and Haizhou Li. Contrastive learning based modality-invariant feature acquisition for robust multimodal emotion recognition with missing modalities.IEEE Transactions on Affective Computing, 15(4):1856–1873, 2024

  6. [6]

    Enhancing multimodal entity and relation extraction with variational information bottleneck.IEEE/ACM Transactions on Audio, Speech, and Language Processing, 32:1274–1285, 2024

    Shiyao Cui, Jiangxia Cao, Xin Cong, Jiawei Sheng, Quangang Li, Tingwen Liu, and Jinqiao Shi. Enhancing multimodal entity and relation extraction with variational information bottleneck.IEEE/ACM Transactions on Audio, Speech, and Language Processing, 32:1274–1285, 2024

  7. [7]

    Multimodal prompt learning with missing modalities for sentiment analysis and emotion recognition.arXiv preprint arXiv:2407.05374, 2024

    Zirun Guo, Tao Jin, and Zhou Zhao. Multimodal prompt learning with missing modalities for sentiment analysis and emotion recognition.arXiv preprint arXiv:2407.05374, 2024

  8. [8]

    Correlation-decoupled knowledge distillation for multimodal sentiment analysis with incomplete modalities

    Mingcheng Li, Dingkang Yang, Xiao Zhao, Shuaibing Wang, Yan Wang, Kun Yang, Mingyang Sun, Dongliang Kou, Ziyun Qian, and Lihua Zhang. Correlation-decoupled knowledge distillation for multimodal sentiment analysis with incomplete modalities. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12458–12468, 2024

  9. [9]

    Toward robust incomplete multimodal sentiment analysis via hierarchical representation learning.Advances in Neural Information Processing Systems, 37:28515–28536, 2024

    Mingcheng Li, Dingkang Yang, Yang Liu, Shunli Wang, Jiawei Chen, Shuaibing Wang, Jinjie Wei, Yue Jiang, Qingyao Xu, Xiaolu Hou, et al. Toward robust incomplete multimodal sentiment analysis via hierarchical representation learning.Advances in Neural Information Processing Systems, 37:28515–28536, 2024

  10. [10]

    Multimodal reconstruct and align net for missing modality problem in sentiment analysis

    Wei Luo, Mengying Xu, and Hanjiang Lai. Multimodal reconstruct and align net for missing modality problem in sentiment analysis. InInternational conference on multimedia modeling, pages 411–422. Springer, 2023

  11. [11]

    Missing as masking: Arbitrary cross-modal feature reconstruction for incomplete multimodal brain tumor segmentation

    Zhilin Zeng, Zelin Peng, Xiaokang Yang, and Wei Shen. Missing as masking: Arbitrary cross-modal feature reconstruction for incomplete multimodal brain tumor segmentation. InInternational Conference on Medical Image Computing and Computer-Assisted Intervention, pages 424–433. Springer, 2024

  12. [12]

    Joint variational autoencoders for multimodal imputation and embedding.Nature machine intelligence, 5(6):631–642, 2023

    Noah Cohen Kalafut, Xiang Huang, and Daifeng Wang. Joint variational autoencoders for multimodal imputation and embedding.Nature machine intelligence, 5(6):631–642, 2023

  13. [13]

    Unified multi-modal image synthesis for missing modality imputation.IEEE Transactions on Medical Imaging, 44(1):4–18, 2024

    Yue Zhang, Chengtao Peng, Qiuli Wang, Dan Song, Kaiyan Li, and S Kevin Zhou. Unified multi-modal image synthesis for missing modality imputation.IEEE Transactions on Medical Imaging, 44(1):4–18, 2024

  14. [14]

    A generative random modality dropout framework for robust multimodal emotion recognition.IEEE Intelligent Systems, 40(5):62–69, 2025

    Yang Zhang, Hui Chen, Imad Rida, and Xianxun Zhu. A generative random modality dropout framework for robust multimodal emotion recognition.IEEE Intelligent Systems, 40(5):62–69, 2025

  15. [15]

    Progressive hard negative masking: From global uniformity to local tolerance.IEEE Transactions on Knowledge and Data Engineering, 35(12):12932–12943, 2023

    Qingqiang Sun, Wenjie Zhang, and Xuemin Lin. Progressive hard negative masking: From global uniformity to local tolerance.IEEE Transactions on Knowledge and Data Engineering, 35(12):12932–12943, 2023. 16 ModalImmune

  16. [16]

    Multimodal information bottleneck: Learning minimal sufficient unimodal and multimodal representations.IEEE Transactions on Multimedia, 25:4121–4134, 2022

    Sijie Mai, Ying Zeng, and Haifeng Hu. Multimodal information bottleneck: Learning minimal sufficient unimodal and multimodal representations.IEEE Transactions on Multimedia, 25:4121–4134, 2022

  17. [17]

    M3care: Learning with missing modalities in multimodal healthcare data

    Chaohe Zhang, Xu Chu, Liantao Ma, Yinghao Zhu, Yasha Wang, Jiangtao Wang, and Junfeng Zhao. M3care: Learning with missing modalities in multimodal healthcare data. InProceedings of the 28th ACM SIGKDD conference on knowledge discovery and data mining, pages 2418–2428, 2022

  18. [18]

    Modality translation-based multimodal sentiment analysis under uncertain missing modalities.Information Fusion, 101:101973, 2024

    Zhizhong Liu, Bin Zhou, Dianhui Chu, Yuhang Sun, and Lingqiang Meng. Modality translation-based multimodal sentiment analysis under uncertain missing modalities.Information Fusion, 101:101973, 2024

  19. [19]

    A closer look at multimodal representation collapse.arXiv preprint arXiv:2505.22483, 2025

    Abhra Chaudhuri, Anjan Dutta, Tu Bui, and Serban Georgescu. A closer look at multimodal representation collapse.arXiv preprint arXiv:2505.22483, 2025

  20. [20]

    Adaptive hierarchical hyper-gradient descent

    Renlong Jie, Junbin Gao, Andrey Vasnev, and Minh-Ngoc Tran. Adaptive hierarchical hyper-gradient descent. International Journal of Machine Learning and Cybernetics, 13(12):3785–3805, 2022

  21. [21]

    Biadam: Fast adaptive bilevel optimization methods.arXiv preprint arXiv:2106.11396, 2021

    Feihu Huang, Junyi Li, and Shangqian Gao. Biadam: Fast adaptive bilevel optimization methods.arXiv preprint arXiv:2106.11396, 2021

  22. [22]

    Data-adaptive m-estimators for robust regression via bi-level optimization.Signal Processing, 210:109063, 2023

    Ceyao Zhang, Tianjian Zhang, Feng Yin, and Abdelhak M Zoubir. Data-adaptive m-estimators for robust regression via bi-level optimization.Signal Processing, 210:109063, 2023

  23. [23]

    Gradient routing: Masking gradients to localize computation in neural networks.arXiv preprint arXiv:2410.04332, 2024

    Alex Cloud, Jacob Goldman-Wetzler, Evžen Wybitul, Joseph Miller, and Alexander Matt Turner. Gradient routing: Masking gradients to localize computation in neural networks.arXiv preprint arXiv:2410.04332, 2024

  24. [24]

    Projective fisher information for natural gradient descent.IEEE Transactions on Artificial Intelligence, 4(2):304–314, 2022

    Piyush Kaul and Brejesh Lall. Projective fisher information for natural gradient descent.IEEE Transactions on Artificial Intelligence, 4(2):304–314, 2022

  25. [25]

    On information gain and regret bounds in gaussian process bandits

    Sattar Vakili, Kia Khezeli, and Victor Picheny. On information gain and regret bounds in gaussian process bandits. InInternational Conference on Artificial Intelligence and Statistics, pages 82–90. PMLR, 2021

  26. [26]

    Causal bandits with general causal models and interventions

    Zirui Yan, Dennis Wei, Dmitriy A Katz, Prasanna Sattigeri, and Ali Tajer. Causal bandits with general causal models and interventions. InInternational Conference on Artificial Intelligence and Statistics, pages 4609–4617. PMLR, 2024

  27. [27]

    Maxinforl: Boosting exploration in reinforcement learning through information gain maximization.arXiv preprint arXiv:2412.12098, 2024

    Bhavya Sukhija, Stelian Coros, Andreas Krause, Pieter Abbeel, and Carmelo Sferrazza. Maxinforl: Boosting exploration in reinforcement learning through information gain maximization.arXiv preprint arXiv:2412.12098, 2024

  28. [28]

    Benchmarking multi-modal semantic segmentation under sensor failures: Missing and noisy modality robustness

    Chenfei Liao, Kaiyu Lei, Xu Zheng, Junha Moon, Zhixiong Wang, Yixuan Wang, Danda Pani Paudel, Luc Van Gool, and Xuming Hu. Benchmarking multi-modal semantic segmentation under sensor failures: Missing and noisy modality robustness. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 1576–1586, 2025

  29. [29]

    Multimodal sentiment analysis: a survey of methods, trends, and challenges.ACM Computing Surveys, 55(13s):1–38, 2023

    Ringki Das and Thoudam Doren Singh. Multimodal sentiment analysis: a survey of methods, trends, and challenges.ACM Computing Surveys, 55(13s):1–38, 2023

  30. [30]

    Multimodal sentiment intensity analysis in videos: Facial gestures and verbal messages.IEEE Intelligent Systems, 31(6):82–88, 2016

    Amir Zadeh, Rowan Zellers, Eli Pincus, and Louis-Philippe Morency. Multimodal sentiment intensity analysis in videos: Facial gestures and verbal messages.IEEE Intelligent Systems, 31(6):82–88, 2016

  31. [31]

    Memory fusion network for multi-view sequential learning

    Amir Zadeh, Paul Pu Liang, Navonil Mazumder, Soujanya Poria, Erik Cambria, and Louis-Philippe Morency. Memory fusion network for multi-view sequential learning. InProceedings of the AAAI conference on artificial intelligence, volume 32, 2018

  32. [32]

    Iemocap: Interactive emotional dyadic motion capture database

    Carlos Busso, Murtaza Bulut, Chi-Chun Lee, Abe Kazemzadeh, Emily Mower, Samuel Kim, Jeannette N Chang, Sungbok Lee, and Shrikanth S Narayanan. Iemocap: Interactive emotional dyadic motion capture database. Language resources and evaluation, 42(4):335–359, 2008

  33. [33]

    Hybrid contrastive learning of tri-modal representation for multimodal sentiment analysis.IEEE Transactions on Affective Computing, 14(3):2276–2289, 2022

    Sijie Mai, Ying Zeng, Shuangjia Zheng, and Haifeng Hu. Hybrid contrastive learning of tri-modal representation for multimodal sentiment analysis.IEEE Transactions on Affective Computing, 14(3):2276–2289, 2022

  34. [34]

    Unimse: Towards unified multimodal sentiment analysis and emotion recognition.arXiv preprint arXiv:2211.11256, 2022

    Guimin Hu, Ting-En Lin, Yi Zhao, Guangming Lu, Yuchuan Wu, and Yongbin Li. Unimse: Towards unified multimodal sentiment analysis and emotion recognition.arXiv preprint arXiv:2211.11256, 2022. 17 ModalImmune

  35. [35]

    Confede: Contrastive feature decomposition for multimodal sentiment analysis

    Jiuding Yang, Yakun Yu, Di Niu, Weidong Guo, and Yu Xu. Confede: Contrastive feature decomposition for multimodal sentiment analysis. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7617–7630, 2023

  36. [36]

    Learning from the global view: Supervised contrastive learning of multimodal representation.Information Fusion, 100:101920, 2023

    Sijie Mai, Ying Zeng, and Haifeng Hu. Learning from the global view: Supervised contrastive learning of multimodal representation.Information Fusion, 100:101920, 2023

  37. [37]

    Hydiscgan: A hybrid distributed cgan for audio-visual privacy preservation in multimodal sentiment analysis.arXiv preprint arXiv:2404.11938, 2024

    Zhuojia Wu, Qi Zhang, Duoqian Miao, Kun Yi, Wei Fan, and Liang Hu. Hydiscgan: A hybrid distributed cgan for audio-visual privacy preservation in multimodal sentiment analysis.arXiv preprint arXiv:2404.11938, 2024

  38. [38]

    Clgsi: a multimodal sentiment analysis framework based on contrastive learning guided by sentiment intensity

    Yang Yang, Xunde Dong, and Yupeng Qiang. Clgsi: a multimodal sentiment analysis framework based on contrastive learning guided by sentiment intensity. InFindings of the Association for Computational Linguistics: NAACL 2024, pages 2099–2110, 2024

  39. [39]

    Dlf: Disentangled-language-focused multimodal sentiment analysis

    Pan Wang, Qiang Zhou, Yawen Wu, Tianlong Chen, and Jingtong Hu. Dlf: Disentangled-language-focused multimodal sentiment analysis. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 21180–21188, 2025

  40. [40]

    Pamoe-msa: polarity-aware mixture of experts network for multimodal sentiment analysis.International Journal of Multimedia Information Retrieval, 14(1):1–16, 2025

    Changqin Huang, Zhenheng Lin, Zhongmei Han, Qionghao Huang, Fan Jiang, and Xiaodi Huang. Pamoe-msa: polarity-aware mixture of experts network for multimodal sentiment analysis.International Journal of Multimedia Information Retrieval, 14(1):1–16, 2025

  41. [41]

    Msamba: Exploring multimodal sentiment analysis with state space models

    Xilin He, Haijian Liang, Boyi Peng, Weicheng Xie, Muhammad Haris Khan, Siyang Song, and Zitong Yu. Msamba: Exploring multimodal sentiment analysis with state space models. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 1309–1317, 2025

  42. [42]

    Two-stage finetuning of wav2vec 2.0 for speech emotion recognition with asr and gender pretraining

    Yuan Gao, Chenhui Chu, and Tatsuya Kawahara. Two-stage finetuning of wav2vec 2.0 for speech emotion recognition with asr and gender pretraining. InProc. Interspeech, pages 3637–3641, 2023

  43. [43]

    Learning robust self-attention features for speech emotion recognition with label-adaptive mixup

    Lei Kang, Lichao Zhang, and Dazhi Jiang. Learning robust self-attention features for speech emotion recognition with label-adaptive mixup. InICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2023

  44. [44]

    Improving speech emotion recognition with unsupervised speaking style transfer

    Leyuan Qu, Wei Wang, Cornelius Weber, Pengcheng Yue, Taihao Li, and Stefan Wermter. Improving speech emotion recognition with unsupervised speaking style transfer. InICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 10101–10105. IEEE, 2024

  45. [45]

    Leveraging knowledge of modality experts for incomplete multimodal learning

    Wenxin Xu, Hexin Jiang, and Xuefeng Liang. Leveraging knowledge of modality experts for incomplete multimodal learning. InProceedings of the 32nd ACM International Conference on Multimedia, pages 438–446, 2024

  46. [46]

    Apin: Amplitude-and phase-aware interaction network for speech emotion recognition.Speech Communication, 169:103201, 2025

    Lili Guo, Jie Li, Shifei Ding, and Jianwu Dang. Apin: Amplitude-and phase-aware interaction network for speech emotion recognition.Speech Communication, 169:103201, 2025

  47. [47]

    Individual-aware attention modulation for unseen speaker emotion recognition.IEEE Transactions on Affective Computing, 2024

    Yuanbo Fang, Xiaofen Xing, Zhaojie Chu, Yifeng Du, and Xiangmin Xu. Individual-aware attention modulation for unseen speaker emotion recognition.IEEE Transactions on Affective Computing, 2024

  48. [48]

    Gatem 2 former: Gated feature selection and expert modeling in multimodal emotion recognition

    Weixiang Xu, Zhongren Dong, Runming Wang, Xinzhou Xu, and Zixing Zhang. Gatem 2 former: Gated feature selection and expert modeling in multimodal emotion recognition. InICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2025

  49. [49]

    Seenet: A soft emotion expert and data augmentation method to enhance speech emotion recognition.IEEE Transactions on Affective Computing, 2025

    Qifei Li, Yingming Gao, Yuhua Wen, Ziping Zhao, Ya Li, and Björn W Schuller. Seenet: A soft emotion expert and data augmentation method to enhance speech emotion recognition.IEEE Transactions on Affective Computing, 2025

  50. [50]

    Gcnet: Graph completion network for incomplete multimodal learning in conversation.IEEE Transactions on pattern analysis and machine intelligence, 45(7): 8419–8432, 2023

    Zheng Lian, Lan Chen, Licai Sun, Bin Liu, and Jianhua Tao. Gcnet: Graph completion network for incomplete multimodal learning in conversation.IEEE Transactions on pattern analysis and machine intelligence, 45(7): 8419–8432, 2023

  51. [51]

    Incomplete multimodality-diffused emotion recognition.Advances in Neural Information Processing Systems, 36:17117–17128, 2023

    Yuanzhi Wang, Yong Li, and Zhen Cui. Incomplete multimodality-diffused emotion recognition.Advances in Neural Information Processing Systems, 36:17117–17128, 2023

  52. [52]

    Towards robust multimodal sentiment analysis with incomplete data.Advances in Neural Information Processing Systems, 37:55943–55974, 2024

    Haoyu Zhang, Wenbin Wang, and Tianshu Yu. Towards robust multimodal sentiment analysis with incomplete data.Advances in Neural Information Processing Systems, 37:55943–55974, 2024. 18 ModalImmune

  53. [53]

    Enhanced experts with uncertainty- aware routing for multimodal sentiment analysis

    Zixian Gao, Disen Hu, Xun Jiang, Huimin Lu, Heng Tao Shen, and Xing Xu. Enhanced experts with uncertainty- aware routing for multimodal sentiment analysis. InProceedings of the 32nd ACM International Conference on Multimedia, pages 9650–9659, 2024

  54. [54]

    Cider: Consensus-based image description evaluation

    Ramakrishna Vedantam, C Lawrence Zitnick, and Devi Parikh. Cider: Consensus-based image description evaluation. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 4566–4575, 2015

  55. [55]

    Rohydr: Robust hybrid diffusion recovery for incomplete multimodal emotion recognition.arXiv preprint arXiv:2505.17501, 2025

    Yuehan Jin, Xiaoqing Liu, Yiyuan Yang, Zhiwen Yu, Tong Zhang, and Kaixiang Yang. Rohydr: Robust hybrid diffusion recovery for incomplete multimodal emotion recognition.arXiv preprint arXiv:2505.17501, 2025

  56. [56]

    A tail inequality for quadratic forms of subgaussian random vectors

    Daniel Hsu, Sham Kakade, and Tong Zhang. A tail inequality for quadratic forms of subgaussian random vectors. 2012. A Theoretical Details We state assumptions used throughout this section. Embeddings z∈R d are sub-Gaussian with parameter σx, and the population covariance Σ =E[zz ⊤] satisfies λmin(Σ)>0 . Stochastic gradients have bounded second moment and ...