pith. sign in

arxiv: 2511.06452 · v3 · submitted 2025-11-09 · 💻 cs.LG

MULTIBENCH++: A Unified and Comprehensive Multimodal Fusion Benchmarking Across Specialized Domains

Pith reviewed 2026-05-17 23:16 UTC · model grok-4.3

classification 💻 cs.LG
keywords multimodal fusionbenchmarkingevaluation pipelinedatasetsmodalitiespredictive tasksdomain adaptationfusion methods
0
0 comments X

The pith

A new benchmark unites over 30 datasets across 15 modalities and 20 tasks to create standardized tests for multimodal fusion methods.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper addresses the problem that current multimodal fusion evaluations rely on too few datasets, which risks biased results and poor generalization to practical uses. It introduces a large-scale benchmark that pulls together diverse datasets, modalities, and tasks from multiple domains along with an automated pipeline for running standardized models and fusion approaches. Large-scale experiments on this platform set new baselines and make direct comparisons between methods possible. If successful, this setup would reduce overfitting to narrow test cases and support the search for more reliable fusion techniques.

Core claim

By integrating over 30 datasets that cover 15 modalities and 20 predictive tasks across key application domains, and by supplying an open-source unified automated evaluation pipeline with standardized implementations of state-of-the-art models and fusion paradigms, the work enables large-scale experiments that establish new performance baselines and supply the community with a platform for rigorous, reproducible assessment of multimodal models.

What carries the argument

The MULTIBENCH++ benchmark together with its automated evaluation pipeline, which standardizes datasets, tasks, models, and fusion methods for consistent testing.

If this is right

  • Fusion methods can be compared directly on a shared set of tasks instead of isolated datasets.
  • Models are less likely to overfit to the quirks of any single dataset and more likely to generalize.
  • New baselines provide concrete targets for future improvements in multimodal performance.
  • The automated pipeline makes it easier to reproduce results and test additional methods quickly.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The benchmark could highlight which fusion strategies succeed in specialized settings such as medical imaging or robotics.
  • Extending the platform with new modalities or tasks would require only minimal changes to the pipeline structure.
  • Researchers might adapt the evaluation code to study trade-offs between accuracy, speed, and data requirements across domains.

Load-bearing premise

The selected datasets and tasks capture the full complexity and variety of real-world multimodal problems without major biases or gaps that would distort evaluations.

What would settle it

Running the same top methods from this benchmark on a fresh collection of multimodal datasets drawn from domains not represented in the 30+ datasets and observing consistent drops in relative performance would indicate the benchmark's coverage is incomplete.

Figures

Figures reproduced from arXiv: 2511.06452 by Changqing Zhang, Guangyu Wang, Kecheng Xue, Leyan Xue, Xiaohong Liu, Zongbo Han.

Figure 1
Figure 1. Figure 1: An overview of the MULTIBENCH++ framework, highlighting our core contributions. (Left) We introduce a broader [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
read the original abstract

Although multimodal fusion has made significant progress, its advancement is severely hindered by the lack of adequate evaluation benchmarks. Current fusion methods are typically evaluated on a small selection of public datasets, a limited scope that inadequately represents the complexity and diversity of real-world scenarios, potentially leading to biased evaluations. This issue presents a twofold challenge. On one hand, models may overfit to the biases of specific datasets, hindering their generalization to broader practical applications. On the other hand, the absence of a unified evaluation standard makes fair and objective comparisons between different fusion methods difficult. Consequently, a truly universal and high-performance fusion model has yet to emerge. To address these challenges, we have developed a large-scale, domain-adaptive benchmark for multimodal evaluation. This benchmark integrates over 30 datasets, encompassing 15 modalities and 20 predictive tasks across key application domains. To complement this, we have also developed an open-source, unified, and automated evaluation pipeline that includes standardized implementations of state-of-the-art models and diverse fusion paradigms. Leveraging this platform, we have conducted large-scale experiments, successfully establishing new performance baselines across multiple tasks. This work provides the academic community with a crucial platform for rigorous and reproducible assessment of multimodal models, aiming to propel the field of multimodal artificial intelligence to new heights.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The manuscript introduces MULTIBENCH++, a large-scale domain-adaptive benchmark that integrates over 30 datasets spanning 15 modalities and 20 predictive tasks across key application domains, together with an open-source unified automated evaluation pipeline containing standardized implementations of state-of-the-art models and fusion paradigms; the authors report large-scale experiments that establish new performance baselines.

Significance. If the coverage and curation claims are substantiated, the benchmark and pipeline would supply the community with a much-needed unified platform for reproducible, cross-method comparisons of multimodal fusion techniques, reducing the risk of dataset-specific overfitting and supporting more reliable progress in multimodal AI.

major comments (1)
  1. [Abstract] Abstract: the central claim that the collection of over 30 datasets 'adequately represents the complexity and diversity of real-world scenarios' is load-bearing for the 'domain-adaptive' and 'unified' properties, yet the manuscript supplies no explicit taxonomy of multimodal scenarios, quantitative coverage metrics, or justification that the chosen datasets avoid systematic under-representation of modality combinations or application regimes.
minor comments (1)
  1. [Abstract] The abstract would be clearer if it briefly defined the 15 modalities and the 20 predictive tasks rather than leaving them as aggregate counts.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and positive evaluation of the benchmark's potential impact. We address the single major comment below and will revise the manuscript to strengthen the supporting arguments for our claims.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that the collection of over 30 datasets 'adequately represents the complexity and diversity of real-world scenarios' is load-bearing for the 'domain-adaptive' and 'unified' properties, yet the manuscript supplies no explicit taxonomy of multimodal scenarios, quantitative coverage metrics, or justification that the chosen datasets avoid systematic under-representation of modality combinations or application regimes.

    Authors: We agree that the abstract's phrasing makes a strong claim that requires clearer substantiation to fully support the 'domain-adaptive' and 'unified' framing. The manuscript describes the curation process, including coverage of 15 modalities and 20 tasks drawn from domains such as healthcare, robotics, and multimedia, with rationale for dataset selection to promote diversity. However, it does not include an explicit taxonomy of multimodal scenarios or quantitative coverage metrics (e.g., modality-pair frequency or regime representation scores). We will revise the abstract to moderate the claim and add a new subsection (or appendix) in the main text that provides (1) a taxonomy organizing scenarios by modality combinations, task types, and application regimes, (2) quantitative metrics on coverage, and (3) explicit discussion of potential gaps and how the selected datasets mitigate systematic under-representation. These additions will be supported by tables and analysis of the 30+ datasets. revision: yes

Circularity Check

0 steps flagged

No circularity: benchmark is a constructive artifact

full rationale

The paper presents the creation of MULTIBENCH++ as the integration of over 30 existing public datasets spanning 15 modalities and 20 tasks, plus an open-source evaluation pipeline with standardized model implementations. This is a curation and standardization effort rather than a derivation chain involving equations, fitted parameters, or predictions that reduce to the inputs by construction. No self-definitional loops, fitted-input-as-prediction, or load-bearing self-citations appear in the abstract or described claims; the coverage assertions rest on explicit dataset selection rather than any mathematical equivalence to prior results. The work is self-contained as an empirical benchmarking platform.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The benchmark rests on the domain assumption that the curated collection of datasets provides representative coverage; no free parameters or invented entities are described in the abstract.

axioms (1)
  • domain assumption The integrated datasets and tasks represent the complexity and diversity of real-world multimodal scenarios
    Invoked when claiming the benchmark addresses biased evaluations from limited public datasets.

pith-pipeline@v0.9.0 · 5544 in / 1262 out tokens · 43841 ms · 2026-05-17T23:16:41.139249+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

71 extracted references · 71 canonical work pages · 3 internal anchors

  1. [1]

    , " * write output.state after.block = add.period write newline

    ENTRY address archivePrefix author booktitle chapter edition editor eid eprint howpublished institution isbn journal key month note number organization pages publisher school series title type volume year label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block FUNCTION init.state.consts #0 'before.a...

  2. [2]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

  3. [3]

    Akiba, T.; Sano, S.; Yanase, T.; Ohta, T.; and Koyama, M. 2019. Optuna: A next-generation hyperparameter optimization framework. In Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining, 2623--2631

  4. [4]

    Alain, G.; and Bengio, Y. 2016. Understanding intermediate layers using linear classifier probes. arXiv preprint arXiv:1610.01644

  5. [5]

    Anumula, J.; Neil, D.; Delbruck, T.; and Liu, S.-C. 2018. Feature representations for neuromorphic audio spike streams. Frontiers in neuroscience, 12: 23

  6. [6]

    K.; Hossain, M

    Atrey, P. K.; Hossain, M. A.; El Saddik, A.; and Kankanhalli, M. S. 2010. Multimodal fusion for multimedia analysis: a survey. Multimedia systems, 16(6): 345--379

  7. [7]

    A.; Khan, K

    Azam, M. A.; Khan, K. B.; Salahuddin, S.; Rehman, E.; Khan, S. A.; Khan, M. A.; Kadry, S.; and Gandomi, A. H. 2022. A review on multimodal medical image fusion: Compendious analysis of medical modalities, multimodal databases, fusion techniques and quality metrics. Computers in biology and medicine, 144: 105253

  8. [8]

    Baltru s aitis, T.; Ahuja, C.; and Morency, L.-P. 2018. Multimodal machine learning: A survey and taxonomy. IEEE transactions on pattern analysis and machine intelligence, 41(2): 423--443

  9. [9]

    A.; Buchman, A

    Bennett, D. A.; Buchman, A. S.; Boyle, P. A.; Barnes, L. L.; Wilson, R. S.; and Schneider, J. A. 2018. Religious orders study and rush memory and aging project. Journal of Alzheimer’s disease, 64(s1): S161--S189

  10. [10]

    N.; Lee, S.; and Narayanan, S

    Busso, C.; Bulut, M.; Lee, C.-C.; Kazemzadeh, A.; Mower, E.; Kim, S.; Chang, J. N.; Lee, S.; and Narayanan, S. S. 2008. IEMOCAP: Interactive emotional dyadic motion capture database. Language resources and evaluation, 42(4): 335--359

  11. [11]

    H.; Vora, S.; Liong, V

    Caesar, H.; Bankiti, V.; Lang, A. H.; Vora, S.; Liong, V. E.; Xu, Q.; Krishnan, A.; Pan, Y.; Baldan, G.; and Beijbom, O. 2020. nuscenes: A multimodal dataset for autonomous driving. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 11621--11631

  12. [12]

    Chen, D.; Su, W.; Wu, P.; and Hua, B. 2023. Joint multimodal sentiment analysis based on information relevance. Information Processing & Management, 60(2): 103193

  13. [13]

    Cohen, G.; Afshar, S.; Tapson, J.; and Van Schaik, A. 2017. EMNIST: Extending MNIST to handwritten letters. In 2017 international joint conference on neural networks (IJCNN), 2921--2926. IEEE

  14. [14]

    Debes, C.; Merentitis, A.; Heremans, R.; Hahn, J.; Frangiadakis, N.; Van Kasteren, T.; Liao, W.; Bellens, R.; Pi z urica, A.; Gautama, S.; et al. 2014. Hyperspectral and LiDAR data fusion: Outcome of the 2013 GRSS data fusion contest. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 7(6): 2405--2418

  15. [15]

    Fersini, E.; Gasparini, F.; Rizzi, G.; Saibene, A.; Chulvi, B.; Rosso, P.; Lees, A.; and Sorensen, J. 2022. SemEval-2022 Task 5: Multimedia automatic misogyny identification. In Proceedings of the 16th International Workshop on Semantic Evaluation (SemEval-2022), 533--549

  16. [16]

    S.; Shlens, J.; Bengio, S.; Dean, J.; Ranzato, M.; and Mikolov, T

    Frome, A.; Corrado, G. S.; Shlens, J.; Bengio, S.; Dean, J.; Ranzato, M.; and Mikolov, T. 2013. Devise: A deep visual-semantic embedding model. Advances in neural information processing systems, 26

  17. [17]

    Gader, P.; Zare, A.; Close, R.; Aitken, J.; and Tuell, G. 2013. MUUFL Gulfport hyperspectral and LiDAR airborne data set. Univ. Florida, Gainesville, FL, USA, Tech. Rep. REP-2013-570

  18. [18]

    Goyal, Y.; Khot, T.; Summers-Stay, D.; Batra, D.; and Parikh, D. 2017. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, 6904--6913

  19. [19]

    Han, Z.; Zhang, C.; Fu, H.; and Zhou, J. T. 2022. Trusted multi-view classification with dynamic evidential fusion. IEEE transactions on pattern analysis and machine intelligence, 45(2): 2551--2566

  20. [20]

    Hossain, E.; Sharif, O.; and Hoque, M. M. 2022. MUTE: A multimodal dataset for detecting hateful memes. In Proceedings of the 2nd conference of the asia-pacific chapter of the association for computational linguistics and the 12th international joint conference on natural language processing: student research workshop, 32--39

  21. [21]

    Hu, J.; Liu, R.; Hong, D.; Camero, A.; Yao, J.; Schneider, M.; Kurz, F.; Segl, K.; and Zhu, X. X. 2023. MDAS: A new multimodal benchmark dataset for remote sensing. Earth System Science Data, 15(1): 113--131

  22. [22]

    J.; and Lew, M

    Huiskes, M. J.; and Lew, M. S. 2008. The mir flickr retrieval evaluation. In Proceedings of the 1st ACM international conference on Multimedia information retrieval, 39--43

  23. [23]

    Irvin, J.; Sheng, H.; Ramachandran, N.; Johnson-Yu, S.; Zhou, S.; Story, K.; Rustowicz, R.; Elsworth, C.; Austin, K.; and Ng, A. Y. 2020. Forestnet: Classifying drivers of deforestation in indonesia using deep learning on satellite imagery. arXiv preprint arXiv:2011.05479

  24. [24]

    E.; Pollard, T

    Johnson, A. E.; Pollard, T. J.; Berkowitz, S. J.; Greenbaum, N. R.; Lungren, M. P.; Deng, C.-y.; Mark, R. G.; and Horng, S. 2019. MIMIC-CXR, a de-identified publicly available database of chest radiographs with free-text reports. Scientific data, 6(1): 317

  25. [25]

    E.; Pollard, T

    Johnson, A. E.; Pollard, T. J.; Shen, L.; Lehman, L.-w. H.; Feng, M.; Ghassemi, M.; Moody, B.; Szolovits, P.; Anthony Celi, L.; and Mark, R. G. 2016. MIMIC-III, a freely accessible critical care database. Scientific data, 3(1): 1--9

  26. [26]

    Kawahara, J.; Daneshvar, S.; Argenziano, G.; and Hamarneh, G. 2018. Seven-point checklist and skin lesion classification using multitask multimodal neural nets. IEEE journal of biomedical and health informatics, 23(2): 538--546

  27. [27]

    Kiros, R.; Salakhutdinov, R.; and Zemel, R. S. 2014. Unifying visual-semantic embeddings with multimodal neural language models. arXiv preprint arXiv:1411.2539

  28. [28]

    Kong, J.; Cooper, L. A. D.; Wang, F.; Gutman, D. A.; Gao, J.; Chisolm, C.; Sharma, A.; Pan, T.; Van Meir, E. G.; Kurc, T. M.; Moreno, C. S.; Saltz, J. H.; and Brat, D. J. 2011. Integrative, Multi-modal Analysis of Glioblastoma Using TCGA Molecular Data, Pathology Images and Clinical Outcomes. IEEE Transactions on Biomedical Engineering, 58(12): 3469--3474

  29. [29]

    A.; and Kanazawa, A

    Li, R.; Yang, S.; Ross, D. A.; and Kanazawa, A. 2021. Ai choreographer: Music conditioned 3d dance generation with aist++. In Proceedings of the IEEE/CVF international conference on computer vision, 13401--13412

  30. [30]

    Li, Y.; Li, Y.; Wang, X.; Jiang, Y.; Zhang, Z.; Zheng, X.; Wang, H.; Zheng, H.-T.; Huang, F.; Zhou, J.; et al. 2024. Benchmarking Multimodal Retrieval Augmented Generation with Dynamic VQA Dataset and Self-adaptive Planning Agent. In The Thirteenth International Conference on Learning Representations

  31. [31]

    P.; Lyu, Y.; Fan, X.; Wu, Z.; Cheng, Y.; Wu, J.; Chen, L.; Wu, P.; Lee, M

    Liang, P. P.; Lyu, Y.; Fan, X.; Wu, Z.; Cheng, Y.; Wu, J.; Chen, L.; Wu, P.; Lee, M. A.; Zhu, Y.; et al. 2021. Multibench: Multiscale benchmarks for multimodal representation learning. Advances in neural information processing systems, 2021(DB1): 1

  32. [32]

    P.; Zadeh, A.; and Morency, L.-P

    Liang, P. P.; Zadeh, A.; and Morency, L.-P. 2024. Foundations & trends in multimodal machine learning: Principles, challenges, and open questions. ACM Computing Surveys, 56(10): 1--42

  33. [33]

    Lin, J.; Yang, A.; Zhang, Y.; Liu, J.; Zhou, J.; and Yang, H. 2020. Interbert: Vision-and-language interaction for multi-modal pretraining. arXiv preprint arXiv:2003.13198

  34. [34]

    Lin, N.; Wang, S.; Li, Y.; Wang, B.; Shi, S.; He, Y.; Zhang, W.; Yu, Y.; Zhang, Y.; Zhang, X.; et al. 2025. Resistive memory-based zero-shot liquid state machine for multimodal event data learning. Nature Computational Science, 5(1): 37--47

  35. [35]

    Liu, Y.; Yuan, Z.; Mao, H.; Liang, Z.; Yang, W.; Qiu, Y.; Cheng, T.; Li, X.; Xu, H.; and Gao, K. 2022. Make acoustic and visual cues matter: Ch-sims v2. 0 dataset and av-mixup consistent module. In Proceedings of the 2022 international conference on multimodal interaction, 247--258

  36. [36]

    Lu, D.; Neves, L.; Carvalho, V.; Zhang, N.; and Ji, H. 2018. Visual attention model for name tagging in multimodal social media. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 1990--1999

  37. [37]

    Lu, J.; Batra, D.; Parikh, D.; and Lee, S. 2019. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Advances in neural information processing systems, 32

  38. [38]

    Nagrani, A.; Yang, S.; Arnab, A.; Jansen, A.; Schmid, C.; and Sun, C. 2021. Attention bottlenecks for multimodal fusion. Advances in neural information processing systems, 34: 14200--14213

  39. [39]

    Niu, T.; Zhu, S.; Pang, L.; and El Saddik, A. 2016. Sentiment analysis on multi-view social data. In International conference on multimedia modeling, 15--27. Springer

  40. [40]

    Okujeni, A.; van der Linden, S.; and Hostert, P. 2016. Berlin-urban-gradient dataset 2009-an enmap preparatory flight campaign

  41. [41]

    K.; and Thakor, N

    Orchard, G.; Jayawant, A.; Cohen, G. K.; and Thakor, N. 2015. Converting static image datasets to spiking neuromorphic datasets using saccades. Frontiers in neuroscience, 9: 437

  42. [42]

    J.; Johnson, A

    Pollard, T. J.; Johnson, A. E.; Raffa, J. D.; Celi, L. A.; Mark, R. G.; and Badawi, O. 2018. The eICU Collaborative Research Database, a freely available multi-center database for critical care research. Scientific data, 5(1): 1--13

  43. [43]

    Poria, S.; Hazarika, D.; Majumder, N.; Naik, G.; Cambria, E.; and Mihalcea, R. 2019. MELD: A Multimodal Multi-Party Dataset for Emotion Recognition in Conversations. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 527--536

  44. [44]

    Ramachandram, D.; and Taylor, G. W. 2017. Deep multimodal learning: A survey on recent advances and trends. IEEE signal processing magazine, 34(6): 96--108

  45. [45]

    Rotemberg, V.; Kurtansky, N.; Betz-Stablein, B.; Caffery, L.; Chousakos, E.; Codella, N.; Combalia, M.; Dusza, S.; Guitera, P.; Gutman, D.; et al. 2021. A patient-centric dataset of images and metadata for identifying melanomas using clinical context. Scientific data, 8(1): 34

  46. [46]

    Sharma, C.; Bhageria, D.; Scott, W.; Pykl, S.; Das, A.; Chakraborty, T.; Pulabaigari, V.; and Gamb \"a ck, B. 2020. SemEval-2020 Task 8: Memotion Analysis-the Visuo-Lingual Metaphor! In Proceedings of the Fourteenth Workshop on Semantic Evaluation, 759--773

  47. [47]

    Shi, Y.; Paige, B.; Torr, P.; et al. 2019. Variational mixture-of-experts autoencoders for multi-modal deep generative models. Advances in neural information processing systems, 32

  48. [48]

    Silberman, N.; Hoiem, D.; Kohli, P.; and Fergus, R. 2012. Indoor segmentation and support inference from rgbd images. In European conference on computer vision, 746--760. Springer

  49. [49]

    Soleymani, M.; Garcia, D.; Jou, B.; Schuller, B.; Chang, S.-F.; and Pantic, M. 2017. A survey of multimodal sentiment analysis. Image and Vision Computing, 65: 3--14

  50. [50]

    P.; and Xiao, J

    Song, S.; Lichtenberg, S. P.; and Xiao, J. 2015. Sun rgb-d: A rgb-d scene understanding benchmark suite. In Proceedings of the IEEE conference on computer vision and pattern recognition, 567--576

  51. [51]

    R.; Arcan, M.; and Buitelaar, P

    Suryawanshi, S.; Chakravarthi, B. R.; Arcan, M.; and Buitelaar, P. 2020. Multimodal meme dataset (MultiOFF) for identifying offensive content in image and text. In Proceedings of the second workshop on trolling, aggression and cyberbullying, 32--41

  52. [52]

    Tao, M.; Huang, Q.; Xu, K.; Chen, L.; Feng, Y.; and Zhao, D. 2024. Probing Multimodal Large Language Models for Global and Local Semantic Representations. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), 13050--13056

  53. [53]

    H.; Bai, S.; Liang, P

    Tsai, Y.-H. H.; Bai, S.; Liang, P. P.; Kolter, J. Z.; Morency, L.-P.; and Salakhutdinov, R. 2019. Multimodal transformer for unaligned multimodal language sequences. In Proceedings of the conference. Association for computational linguistics. Meeting, volume 2019, 6558

  54. [54]

    University of Trento . 2022. Theses of the University of Trento. [Data set]. Original work published 2020

  55. [55]

    Wang, X.; Kumar, D.; Thome, N.; Cord, M.; and Precioso, F. 2015. Recipe recognition with large multimodal food dataset. In 2015 IEEE International Conference on Multimedia & Expo Workshops (ICMEW), 1--6. IEEE

  56. [56]

    Wang, Y.; Chen, X.; Cao, L.; Huang, W.; Sun, F.; and Wang, Y. 2022. Multimodal Token Fusion for Vision Transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition ( CVPR )

  57. [57]

    Wei, X.; Zhang, T.; Li, Y.; Zhang, Y.; and Wu, F. 2020. Multi-Modality Cross Attention Network for Image and Sentence Matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition ( CVPR )

  58. [58]

    N.; Collisson, E

    Weinstein, J. N.; Collisson, E. A.; Mills, G. B.; Shaw, K. R.; Ozenberger, B. A.; Ellrott, K.; Shmulevich, I.; Sander, C.; and Stuart, J. M. 2013. The cancer genome atlas pan-cancer analysis project. Nature genetics, 45(10): 1113--1120

  59. [59]

    R.; Avansino, D

    Willett, F. R.; Avansino, D. T.; Hochberg, L. R.; Henderson, J. M.; and Shenoy, K. V. 2021. High-performance brain-to-text communication via handwriting. Nature, 593(7858): 249--254

  60. [60]

    Wu, J.; Fang, H.; Li, F.; Fu, H.; Lin, F.; Li, J.; Huang, Y.; Yu, Q.; Song, S.; Xu, X.; et al. 2023. Gamma challenge: glaucoma grading from multi-modality images. Medical Image Analysis, 90: 102938

  61. [61]

    Xu, B.; Li, T.; Zheng, J.; Naseriparsa, M.; Zhao, Z.; Lin, H.; and Xia, F. 2022. Met-meme: A multimodal meme dataset rich in metaphors. In Proceedings of the 45th international ACM SIGIR conference on research and development in information retrieval, 2887--2899

  62. [62]

    Xu, P.; Zhu, X.; and Clifton, D. A. 2023. Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(10): 12113--12132

  63. [63]

    Xu, Y.; Du, B.; Zhang, F.; and Zhang, L. 2018. Hyperspectral image classification via a random patches network. ISPRS journal of photogrammetry and remote sensing, 142: 344--357

  64. [64]

    Yu, W.; Xu, H.; Meng, F.; Zhu, Y.; Ma, Y.; Wu, J.; Zou, J.; and Yang, K. 2020. Ch-sims: A chinese multimodal sentiment analysis dataset with fine-grained annotation of modality. In Proceedings of the 58th annual meeting of the association for computational linguistics, 3718--3727

  65. [65]

    Zadeh, A.; Zellers, R.; Pincus, E.; and Morency, L.-P. 2016. Mosi: multimodal corpus of sentiment intensity and subjectivity analysis in online opinion videos. arXiv preprint arXiv:1606.06259

  66. [66]

    B.; Liang, P

    Zadeh, A. B.; Liang, P. P.; Poria, S.; Cambria, E.; and Morency, L.-P. 2018. Multimodal language analysis in the wild: Cmu-mosei dataset and interpretable dynamic fusion graph. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2236--2246

  67. [67]

    Zhan, X.; Wu, Y.; Dong, X.; Wei, Y.; Lu, M.; Zhang, Y.; Xu, H.; and Liang, X. 2021. Product1m: Towards weakly supervised instance-level product retrieval via cross-modal pretraining. In Proceedings of the IEEE/CVF international conference on computer vision, 11782--11791

  68. [68]

    Zhang, Q.; Fu, J.; Liu, X.; and Huang, X. 2018. Adaptive co-attention network for named entity recognition in tweets. In Proceedings of the AAAI conference on artificial intelligence, volume 32

  69. [69]

    Zhang, Q.; Wei, Y.; Han, Z.; Fu, H.; Peng, X.; Deng, C.; Hu, Q.; Xu, C.; Wen, J.; Hu, D.; et al. 2024. Multimodal fusion on low-quality data: A comprehensive survey. arXiv preprint arXiv:2404.18947

  70. [70]

    T.; and Peng, X

    Zhang, Q.; Wu, H.; Zhang, C.; Hu, Q.; Fu, H.; Zhou, J. T.; and Peng, X. 2023. Provable dynamic fusion for low-quality multimodal data. In International conference on machine learning, 41753--41769. PMLR

  71. [71]

    Zhu, J.; Zhou, Y.; Qian, S.; He, Z.; Zhao, T.; Shah, N.; and Koutra, D. 2025. Mosaic of modalities: A comprehensive benchmark for multimodal graph learning. In Proceedings of the Computer Vision and Pattern Recognition Conference, 14215--14224