arxiv: 2604.06728 · v2 · submitted 2026-04-08 · 💻 cs.CV · cs.AI· cs.MM

URMF: Uncertainty-aware Robust Multimodal Fusion for Multimodal Sarcasm Detection

Zhenyu Wang , Weichen Cheng , Weijia Li , Junjie Mou , Zongyou Zhao , Guoying Zhang This is my paper

Pith reviewed 2026-05-10 19:04 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.MM

keywords multimodal sarcasm detectionuncertainty estimationGaussian posteriorsrobust fusioncross-modal attentionincongruity reasoning

0 comments

The pith

By representing modalities as Gaussian distributions, URMF dynamically weights text and image inputs according to their estimated uncertainty to better detect sarcasm.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper introduces URMF, a framework that models text, image, and their interaction representations as learnable Gaussian posteriors to capture uncertainty in each. It then uses these uncertainty estimates to adjust how much each modality contributes to the sarcasm prediction, rather than assuming they are equally reliable. This is important because social media posts often have varying levels of noise or irrelevance in text or images, which can obscure the semantic incongruity that signals sarcasm. The method combines cross-attention for fusion with self-attention for reasoning, optimized under a multi-part loss including contrastive learning. Experiments show it beats previous approaches on two benchmarks.

Core claim

URMF injects visual evidence into textual representations via multi-head cross-attention, enhances incongruity reasoning with self-attention in the fused space, models textual, visual, and interaction-aware representations as learnable Gaussian posteriors for uncertainty estimation, and dynamically adjusts modality contributions based on this uncertainty to suppress unreliable evidence, achieving superior performance on the MSD and MMSD2 benchmarks through a unified optimization objective.

What carries the argument

Learnable Gaussian posteriors for textual, visual, and interaction-aware representations that enable uncertainty estimation and dynamic fusion adjustment.

If this is right

Improved accuracy in detecting sarcasm by avoiding over-reliance on noisy modalities.
Enhanced robustness when applied to real-world social media content with inconsistent modality quality.
Potential for better incongruity reasoning by preserving cues from reliable modalities.
The unified objective helps align distributions and regularize the model effectively.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This uncertainty modeling approach may generalize to other multimodal tasks where modality reliability varies, such as sentiment analysis or visual question answering.
Future work could explore how these Gaussian estimates correlate with human judgments of modality trustworthiness.
Applying similar techniques in low-resource settings might reduce the need for high-quality paired data.

Load-bearing premise

The uncertainty values derived from the Gaussian posteriors truly correspond to the actual noise and reliability differences between text and images in sarcastic posts.

What would settle it

A test set where modalities have known noise levels added artificially, showing that the model's uncertainty estimates do not match the injected noise levels or that performance does not improve when using the uncertainty-based weighting.

Figures

Figures reproduced from arXiv: 2604.06728 by Guoying Zhang, Junjie Mou, Weichen Cheng, Weijia Li, Zhenyu Wang, Zongyou Zhao.

**Figure 1.** Figure 1: The overall framework of URMF. 2.2 Uncertainty in Multimodal Learning Uncertainty estimation is widely used to improve robustness in deep learning. In multimodal learning, aleatoric uncertainty is important because different modalities often exhibit different noise levels and reliability [10]. Recent studies show that uncertainty can serve as an adaptive signal for robust multimodal fusion, especially und… view at source ↗

**Figure 2.** Figure 2: t-SNE visualization of the joint latent representations of major ablation [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

read the original abstract

Multimodal sarcasm detection (MSD) aims to identify sarcastic intent from semantic incongruity between text and image. Although recent methods have improved MSD through cross-modal interaction and incongruity reasoning, most still treat modalities as equally reliable. In real social media posts, however, text and images often differ in noise level and relevance, making deterministic fusion susceptible to noisy evidence and weakened incongruity cues. To address this issue, we propose Uncertainty-aware Robust Multimodal Fusion (URMF), a unified framework for robust MSD. URMF first injects visual evidence into textual representations through multi-head cross-attention, and then applies self-attention in the fused semantic space to enhance incongruity reasoning. It models textual, visual, and interaction-aware representations as learnable Gaussian posteriors to estimate modality-specific uncertainty. Based on the estimated uncertainty, URMF dynamically adjusts modality contributions during fusion to suppress unreliable evidence. We further optimize the model with a unified objective that combines information bottleneck regularization, modality prior regularization, cross-modal distribution alignment, and uncertainty-driven contrastive learning. Experiments on the public MSD and MMSD2 benchmarks show that URMF outperforms representative unimodal, multimodal, and MLLM-based baselines. The results demonstrate that explicit uncertainty modeling can improve both accuracy and robustness in multimodal sarcasm detection.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

URMF adds Gaussian posteriors and a four-term loss to multimodal sarcasm detection for dynamic fusion, with reported gains on two benchmarks, but the uncertainties lack direct validation against actual noise.

read the letter

The paper's main move is to model textual, visual, and fused representations as learnable Gaussian posteriors so the fusion step can down-weight unreliable modalities during sarcasm detection. It uses multi-head cross-attention to inject visual evidence into text, self-attention in the combined space for incongruity, and a single objective that mixes information-bottleneck regularization, modality priors, cross-modal alignment, and uncertainty-driven contrastive terms. Experiments claim it beats unimodal, multimodal, and some MLLM baselines on the public MSD and MMSD2 sets.

Referee Report

3 major / 2 minor

Summary. The paper proposes Uncertainty-aware Robust Multimodal Fusion (URMF) for multimodal sarcasm detection. It injects visual evidence into text via multi-head cross-attention, applies self-attention for incongruity reasoning, models textual/visual/interaction-aware representations as learnable Gaussian posteriors to estimate modality-specific uncertainty, and dynamically adjusts fusion weights to suppress unreliable evidence. Optimization uses a unified objective combining information bottleneck regularization, modality prior regularization, cross-modal distribution alignment, and uncertainty-driven contrastive learning. Experiments on the public MSD and MMSD2 benchmarks report outperformance over unimodal, multimodal, and MLLM baselines.

Significance. If the learned Gaussian uncertainties meaningfully track real modality noise and reliability (rather than serving as auxiliary weights), the framework could advance robust multimodal fusion by better preserving incongruity signals under noisy social-media conditions. The unified objective and explicit Gaussian posterior modeling provide a coherent way to incorporate uncertainty, with potential transfer to other multimodal tasks involving variable evidence quality.

major comments (3)

[Method (Gaussian posterior modeling and uncertainty-driven fusion)] The central claim that dynamic adjustment based on estimated uncertainty suppresses unreliable evidence while preserving incongruity cues rests on the assumption that the learnable Gaussian variances reflect actual modality reliability. No correlation analysis, synthetic-noise injection experiments, or oracle-reliability comparison is described to validate this; without such evidence the variances may function simply as learned weighting parameters.
[Objective function and optimization] The unified objective (information bottleneck + modality prior + cross-modal alignment + uncertainty-driven contrastive loss) is presented without an explicit derivation or ablation showing how each term constrains the variance parameters to correlate with observable noise rather than target-label prediction. This leaves open whether the reported gains on MSD/MMSD2 are attributable to uncertainty awareness or to the additional capacity of the Gaussian parameterization.
[Experiments] Table or figure reporting benchmark results: the abstract states outperformance but provides no numerical margins, standard deviations across runs, or statistical significance tests against the strongest MLLM baselines, making it impossible to judge whether the robustness improvements are practically meaningful.

minor comments (2)

[Method] Notation for the Gaussian parameters (means and variances) should be introduced with explicit equations rather than descriptive text to allow readers to verify the information-bottleneck and alignment terms.
[Abstract] The abstract claims robustness gains but does not define the robustness evaluation protocol (e.g., noise injection levels or out-of-distribution splits) used to support that claim.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We sincerely thank the referee for the constructive and detailed feedback. We address each major comment below and will incorporate revisions to strengthen the validation and presentation of our work.

read point-by-point responses

Referee: [Method (Gaussian posterior modeling and uncertainty-driven fusion)] The central claim that dynamic adjustment based on estimated uncertainty suppresses unreliable evidence while preserving incongruity cues rests on the assumption that the learnable Gaussian variances reflect actual modality reliability. No correlation analysis, synthetic-noise injection experiments, or oracle-reliability comparison is described to validate this; without such evidence the variances may function simply as learned weighting parameters.

Authors: We appreciate the referee's emphasis on validating that the learned variances capture genuine modality reliability rather than serving as generic weights. The Gaussian posteriors are optimized jointly with the uncertainty-driven contrastive loss and information bottleneck term, which are explicitly designed to penalize overconfident representations from noisy modalities and encourage higher variance for unreliable evidence. This is not arbitrary parameterization, as the fusion weights are derived directly from the posterior variances. However, we acknowledge that direct empirical validation (e.g., correlation with injected noise or oracle comparisons) is absent from the current manuscript. We will add a dedicated subsection with synthetic noise injection experiments on MSD/MMSD2, correlation plots between estimated variances and noise levels, and oracle-reliability baselines in the revised version. revision: yes
Referee: [Objective function and optimization] The unified objective (information bottleneck + modality prior + cross-modal alignment + uncertainty-driven contrastive loss) is presented without an explicit derivation or ablation showing how each term constrains the variance parameters to correlate with observable noise rather than target-label prediction. This leaves open whether the reported gains on MSD/MMSD2 are attributable to uncertainty awareness or to the additional capacity of the Gaussian parameterization.

Authors: We thank the referee for this observation. The objective combines the terms to regularize the posteriors such that the information bottleneck and modality prior promote compact yet informative distributions, while the contrastive loss aligns low-uncertainty representations across modalities. Nevertheless, the manuscript does not include an explicit derivation or component-wise ablations isolating effects on the variance parameters. In the revision, we will add a derivation in the appendix explaining the gradient flow on the variance terms and report ablation results (removing each loss term individually) showing impacts on both uncertainty estimation quality and final detection accuracy. This will help distinguish the contribution of uncertainty awareness from parameterization capacity. revision: yes
Referee: [Experiments] Table or figure reporting benchmark results: the abstract states outperformance but provides no numerical margins, standard deviations across runs, or statistical significance tests against the strongest MLLM baselines, making it impossible to judge whether the robustness improvements are practically meaningful.

Authors: We agree that detailed quantitative reporting is necessary for assessing practical significance. While the full experiments section contains benchmark tables, we will revise them to explicitly report numerical performance margins over all baselines (including the strongest MLLM methods), mean and standard deviation across multiple random seeds, and statistical significance (e.g., paired t-test p-values) against the top-performing baselines. These enhancements will also be reflected in an updated abstract and a new summary table for clarity. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper defines a multimodal fusion architecture that parameterizes textual, visual, and fused representations as learnable Gaussian posteriors, optimizes them end-to-end via a composite loss (information bottleneck, modality prior, alignment, and uncertainty-driven contrastive terms) on the public MSD and MMSD2 training splits, and reports accuracy/robustness gains on the corresponding held-out test sets against external baselines. No equation or step reduces a claimed result to its own inputs by construction; the uncertainty parameters are ordinary trainable variables whose values are determined by gradient descent on labeled data, and the performance metric is measured on independent test examples. No self-citations, imported uniqueness theorems, or ansatzes are invoked as load-bearing premises. The derivation is therefore self-contained against the stated external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

Review performed on abstract only; full derivation, parameter counts, and experimental controls are unavailable, so the ledger is necessarily incomplete and conservative.

free parameters (1)

learnable Gaussian posterior parameters
Means and variances for textual, visual, and interaction-aware representations are learned from data to estimate uncertainty.

axioms (1)

domain assumption Sarcastic intent manifests as detectable semantic incongruity between text and image modalities.
Stated as the core goal of MSD in the abstract.

invented entities (1)

learnable Gaussian posteriors for modality representations no independent evidence
purpose: To quantify and propagate modality-specific uncertainty during fusion.
Introduced in the abstract as the mechanism for dynamic weighting; no external falsifiable prediction supplied.

pith-pipeline@v0.9.0 · 5550 in / 1483 out tokens · 54014 ms · 2026-05-10T19:04:29.974436+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

39 extracted references · 39 canonical work pages · 2 internal anchors

[1]

Deep Variational Information Bottleneck

Alemi, A.A., Fischer, I., Dillon, J.V., Murphy, K.: Deep variational information bottleneck. arXiv preprint arXiv:1612.00410 (2016)

work page internal anchor Pith review arXiv 2016
[2]

In: Proceedings of the 57th annual meeting of the association for computational linguistics

Cai, Y., Cai, H., Wan, X.: Multi-modal sarcasm detection in twitter with hierar- chical fusion model. In: Proceedings of the 57th annual meeting of the association for computational linguistics. pp. 2506–2515 (2019)

work page 2019
[3]

In: Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers)

Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidi- rectional transformers for language understanding. In: Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers). pp. 4171–4186 (2019)

work page 2019
[4]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)

work page internal anchor Pith review Pith/arXiv arXiv 2010
[5]

Frontiers of Computer Science 20(7), 2007336 (2026)

Fang, Y., Zhang, L., Wang, S., Zhang, W., Wang, Y., Liu, Y., Wei, X., Yang, X.: Dcpnet: a comprehensive framework for multimodal sarcasm detection via graph topology extraction and multi-scale feature fusion. Frontiers of Computer Science 20(7), 2007336 (2026)

work page 2026
[6]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Gao, Z., Jiang, X., Xu, X., Shen, F., Li, Y., Shen, H.T.: Embracing uni- modal aleatoric uncertainty for robust multimodal fusion. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 26876– 26885 (2024)

work page 2024
[7]

Neural networks18(5-6), 602–610 (2005)

Graves, A., Schmidhuber, J.: Framewise phoneme classification with bidirectional lstm and other neural network architectures. Neural networks18(5-6), 602–610 (2005)

work page 2005
[8]

In: Proceedings of the 31st International Conference on Computational Linguistics

Guo, D., Cao, C., Yuan, F., Liu, Y., Zeng, G., Yu, X., Peng, H., Yu, P.S.: Multi- view incongruity learning for multimodal sarcasm detection. In: Proceedings of the 31st International Conference on Computational Linguistics. pp. 1754–1766 (2025)

work page 2025
[9]

In: Proceedings of the AAAI conference on artificial intelligence

Jia, M., Xie, C., Jing, L.: Debiasing multimodal sarcasm detection with contrastive learning. In: Proceedings of the AAAI conference on artificial intelligence. vol. 38, pp. 18354–18362 (2024) URMF for Multimodal Sarcasm Detection 11

work page 2024
[10]

Kendall, A., Gal, Y.: What uncertainties do we need in bayesian deep learning for computer vision? Advances in neural information processing systems30(2017)

work page 2017
[11]

In: Proceedings of the 31st International Conference on Computational Linguistics

Li, K., Chen, Y., Wu, Q., Mai, W., Li, F., Xue, Y.: Ambiguity-aware multi-level in- congruity fusion network for multi-modal sarcasm detection. In: Proceedings of the 31st International Conference on Computational Linguistics. pp. 380–391 (2025)

work page 2025
[12]

Liang,B.,Lou,C.,Li,X.,Gui,L.,Yang,M.,Xu,R.:Multi-modalsarcasmdetection withinteractivein-modalandcross-modalgraphs.In:Proceedingsofthe29thACM international conference on multimedia. pp. 4707–4715 (2021)

work page 2021
[13]

In: Proceedings of the 60th annual meeting of the association for computational linguistics (volume 1: Long papers)

Liang, B., Lou, C., Li, X., Yang, M., Gui, L., He, Y., Pei, W., Xu, R.: Multi-modal sarcasm detection via cross-modal graph convolutional network. In: Proceedings of the 60th annual meeting of the association for computational linguistics (volume 1: Long papers). pp. 1767–1777 (2022)

work page 2022
[14]

In: Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

Liu, H., Wang, W., Li, H.: Towards multi-modal sarcasm detection via hierarchi- cal congruity modeling with knowledge enhancement. In: Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. pp. 4995–5006 (2022)

work page 2022
[15]

Information Fusion104, 102203 (2024)

Lu, Q., Long, Y., Sun, X., Feng, J., Zhang, H.: Fact-sentiment incongruity combi- nation network for multimodal sarcasm detection. Information Fusion104, 102203 (2024)

work page 2024
[16]

In: Findings of the Association for Computational Linguistics: EMNLP 2020

Pan, H., Lin, Z., Fu, P., Qi, Y., Wang, W.: Modeling intra and inter-modality incongruity for multi-modal sarcasm detection. In: Findings of the Association for Computational Linguistics: EMNLP 2020. pp. 1383–1392 (2020)

work page 2020
[17]

In: Proceedings of the AAAI conference on artificial intelligence

Qiao, Y., Jing, L., Song, X., Chen, X., Zhu, L., Nie, L.: Mutual-enhanced incon- gruity learning network for multi-modal sarcasm detection. In: Proceedings of the AAAI conference on artificial intelligence. vol. 37, pp. 9507–9515 (2023)

work page 2023
[18]

0: Towards a reliable multi-modal sarcasm detection system

Qin, L., Huang, S., Chen, Q., Cai, C., Zhang, Y., Liang, B., Che, W., Xu, R.: Mmsd2. 0: Towards a reliable multi-modal sarcasm detection system. In: Findings of the association for computational linguistics: ACL 2023. pp. 10834–10845 (2023)

work page 2023
[19]

In: Proceedings of the 2024 Conference of the North American Chapter of the As- sociation for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

Tang, B., Lin, B., Yan, H., Li, S.: Leveraging generative large language models with visual instruction and demonstration retrieval for multimodal sarcasm detection. In: Proceedings of the 2024 Conference of the North American Chapter of the As- sociation for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). pp. 1732–1742 (2024)

work page 2024
[20]

In: Proceedings of the 56th Annual Meeting of the Association for Com- putational Linguistics (Volume 1: Long Papers)

Tay, Y., Tuan, L.A., Hui, S.C., Su, J.: Reasoning with sarcasm by reading in- between. In: Proceedings of the 56th Annual Meeting of the Association for Com- putational Linguistics (Volume 1: Long Papers). pp. 1010–1020 (2018)

work page 2018
[21]

In: Proceedings of the 61st Annual Meeting of the AssociationforComputationalLinguistics(Volume1:LongPapers).pp.2468–2480 (2023)

Tian, Y., Xu, N., Zhang, R., Mao, W.: Dynamic routing transformer network for multimodal sarcasm detection. In: Proceedings of the 61st Annual Meeting of the AssociationforComputationalLinguistics(Volume1:LongPapers).pp.2468–2480 (2023)

work page 2023
[22]

Information Fusion 103, 102132 (2024)

Wang, J., Yang, Y., Jiang, Y., Ma, M., Xie, Z., Li, T.: Cross-modal incongruity aligning and collaborating for multi-modal sarcasm detection. Information Fusion 103, 102132 (2024)

work page 2024
[23]

Knowledge-Based Systems319, 113614 (2025)

Wang, T., Li, J., Su, G., Zhang, Y., Su, D., Hu, Y., Sha, Y.: Rclmufn: Relational context learning and multiplex fusion network for multimodal sarcasm detection. Knowledge-Based Systems319, 113614 (2025)

work page 2025
[24]

In: Proceedings of the first international workshop on natural language processing beyond text

Wang, X., Sun, X., Yang, T., Wang, H.: Building a bridge: a method for image-text sarcasm detection without pretraining on image-text data. In: Proceedings of the first international workshop on natural language processing beyond text. pp. 19–29 (2020) 12 Z. Wang et al

work page 2020
[25]

Knowledge-Based Systems300, 112109 (2024)

Wei, Y., Duan, M., Zhou, H., Jia, Z., Gao, Z., Wang, L.: Towards multimodal sarcasm detection via label-aware graph contrastive learning with back-translation augmentation. Knowledge-Based Systems300, 112109 (2024)

work page 2024
[26]

In: Proceedings of the AAAI conference on artificial intelligence

Wei, Y., Yuan, S., Zhou, H., Wang, L., Yan, Z., Yang, R., Chen, M.: Gˆ 2sam: graph-based global semantic awareness method for multimodal sarcasm detection. In: Proceedings of the AAAI conference on artificial intelligence. vol. 38, pp. 9151– 9159 (2024)

work page 2024
[27]

Wei, Y., Zhou, H., Yuan, S., Chen, M., Shi, H., Jia, Z., Wang, L., He, X.: Deepmsd: Advancing multimodal sarcasm detection through knowledge-augmented graph reasoning.IEEETransactionsonCircuitsandSystemsforVideoTechnology35(7), 6413–6423 (2025)

work page 2025
[28]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Wen, C., Jia, G., Yang, J.: Dip: Dual incongruity perceiving network for sarcasm detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 2540–2550 (2023)

work page 2023
[29]

Expert Systems with Applications 275, 127020 (2025)

Xi, Z., Yu, B., Wang, H.: Multimodal sarcasm detection based on sentiment-clue inconsistency global detection fusion network. Expert Systems with Applications 275, 127020 (2025)

work page 2025
[30]

In: The world wide web conference

Xiong, T., Zhang, P., Zhu, H., Yang, Y.: Sarcasm detection with self-matching networks and low-rank bilinear pooling. In: The world wide web conference. pp. 2115–2124 (2019)

work page 2019
[31]

In: Proceedings of the 58th annual meeting of the association for computational linguistics

Xu, N., Zeng, Z., Mao, W.: Reasoning with multimodal sarcastic tweets via mod- eling cross-modality contrast and semantic association. In: Proceedings of the 58th annual meeting of the association for computational linguistics. pp. 3777–3786 (2020)

work page 2020
[32]

IEEE Transactions on Multimedia (2025)

Yuan, S., Wei, Y., Zhou, H., Xu, Q., Chen, M., He, X.: Enhancing semantic aware- ness by sentimental constraint with automatic outlier masking for multimodal sar- casm detection. IEEE Transactions on Multimedia (2025)

work page 2025
[33]

IEEE Transactions on Affective Comput- ing (2026)

Yue, T., Mao, R., Shi, X., Cambria, E.: Interarm: Interpretable affective reasoning model for multimodal sarcasm detection. IEEE Transactions on Affective Comput- ing (2026)

work page 2026
[34]

Information Fusion100, 101921 (2023)

Yue, T., Mao, R., Wang, H., Hu, Z., Cambria, E.: Knowlenet: Knowledge fusion network for multimodal sarcasm detection. Information Fusion100, 101921 (2023)

work page 2023
[35]

In: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Zhang, J., Chen, C.P., Li, S., Zhang, T.: Incongruity-aware tension field network for multi-modal sarcasm detection. In: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). pp. 14499–14508 (2025)

work page 2025
[36]

0: A multi-image benchmark for real-world multimodal sarcasm detection

Zhao, H., Kong, Y., Xu, Y., Gou, G., Xu, H., Wang, Y., Zhang, H.: Mmsd3. 0: A multi-image benchmark for real-world multimodal sarcasm detection. arXiv preprint arXiv:2510.23299 (2025)

work page arXiv 2025
[37]

In: ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Zhou, H., Yan, J., Chen, Y., Hong, R., Zuo, W., Jin, K.: Ldgnet: Llms debate- guided network for multimodal sarcasm detection. In: ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). pp. 1–5. IEEE (2025)

work page 2025
[38]

In: 2025 10th International Conference on Information and Network Technologies (ICINT)

Zhou, J., Wu, Y., Zhang, Y., Zhang, Y., Liu, Y., Huang, B., Yuan, C.: Semirnet: A semantic irony recognition network for multimodal sarcasm detection. In: 2025 10th International Conference on Information and Network Technologies (ICINT). pp. 158–162. IEEE (2025)

work page 2025
[39]

ACM Transactions on Multimedia Comput- ing, Communications and Applications21(5), 1–22 (2025)

Zhuang, X., Zhou, F., Li, Z.: Multi-modal sarcasm detection via knowledge-aware focused graph convolutional networks. ACM Transactions on Multimedia Comput- ing, Communications and Applications21(5), 1–22 (2025)

work page 2025