MULTIBENCH++: A Unified and Comprehensive Multimodal Fusion Benchmarking Across Specialized Domains

Changqing Zhang; Guangyu Wang; Kecheng Xue; Leyan Xue; Xiaohong Liu; Zongbo Han

arxiv: 2511.06452 · v3 · submitted 2025-11-09 · 💻 cs.LG

MULTIBENCH++: A Unified and Comprehensive Multimodal Fusion Benchmarking Across Specialized Domains

Leyan Xue , Changqing Zhang , Kecheng Xue , Xiaohong Liu , Guangyu Wang , Zongbo Han This is my paper

Pith reviewed 2026-05-17 23:16 UTC · model grok-4.3

classification 💻 cs.LG

keywords multimodal fusionbenchmarkingevaluation pipelinedatasetsmodalitiespredictive tasksdomain adaptationfusion methods

0 comments

The pith

A new benchmark unites over 30 datasets across 15 modalities and 20 tasks to create standardized tests for multimodal fusion methods.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper addresses the problem that current multimodal fusion evaluations rely on too few datasets, which risks biased results and poor generalization to practical uses. It introduces a large-scale benchmark that pulls together diverse datasets, modalities, and tasks from multiple domains along with an automated pipeline for running standardized models and fusion approaches. Large-scale experiments on this platform set new baselines and make direct comparisons between methods possible. If successful, this setup would reduce overfitting to narrow test cases and support the search for more reliable fusion techniques.

Core claim

By integrating over 30 datasets that cover 15 modalities and 20 predictive tasks across key application domains, and by supplying an open-source unified automated evaluation pipeline with standardized implementations of state-of-the-art models and fusion paradigms, the work enables large-scale experiments that establish new performance baselines and supply the community with a platform for rigorous, reproducible assessment of multimodal models.

What carries the argument

The MULTIBENCH++ benchmark together with its automated evaluation pipeline, which standardizes datasets, tasks, models, and fusion methods for consistent testing.

If this is right

Fusion methods can be compared directly on a shared set of tasks instead of isolated datasets.
Models are less likely to overfit to the quirks of any single dataset and more likely to generalize.
New baselines provide concrete targets for future improvements in multimodal performance.
The automated pipeline makes it easier to reproduce results and test additional methods quickly.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The benchmark could highlight which fusion strategies succeed in specialized settings such as medical imaging or robotics.
Extending the platform with new modalities or tasks would require only minimal changes to the pipeline structure.
Researchers might adapt the evaluation code to study trade-offs between accuracy, speed, and data requirements across domains.

Load-bearing premise

The selected datasets and tasks capture the full complexity and variety of real-world multimodal problems without major biases or gaps that would distort evaluations.

What would settle it

Running the same top methods from this benchmark on a fresh collection of multimodal datasets drawn from domains not represented in the 30+ datasets and observing consistent drops in relative performance would indicate the benchmark's coverage is incomplete.

Figures

Figures reproduced from arXiv: 2511.06452 by Changqing Zhang, Guangyu Wang, Kecheng Xue, Leyan Xue, Xiaohong Liu, Zongbo Han.

**Figure 1.** Figure 1: An overview of the MULTIBENCH++ framework, highlighting our core contributions. (Left) We introduce a broader [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗

read the original abstract

Although multimodal fusion has made significant progress, its advancement is severely hindered by the lack of adequate evaluation benchmarks. Current fusion methods are typically evaluated on a small selection of public datasets, a limited scope that inadequately represents the complexity and diversity of real-world scenarios, potentially leading to biased evaluations. This issue presents a twofold challenge. On one hand, models may overfit to the biases of specific datasets, hindering their generalization to broader practical applications. On the other hand, the absence of a unified evaluation standard makes fair and objective comparisons between different fusion methods difficult. Consequently, a truly universal and high-performance fusion model has yet to emerge. To address these challenges, we have developed a large-scale, domain-adaptive benchmark for multimodal evaluation. This benchmark integrates over 30 datasets, encompassing 15 modalities and 20 predictive tasks across key application domains. To complement this, we have also developed an open-source, unified, and automated evaluation pipeline that includes standardized implementations of state-of-the-art models and diverse fusion paradigms. Leveraging this platform, we have conducted large-scale experiments, successfully establishing new performance baselines across multiple tasks. This work provides the academic community with a crucial platform for rigorous and reproducible assessment of multimodal models, aiming to propel the field of multimodal artificial intelligence to new heights.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MULTIBENCH++ expands dataset count and adds an automated pipeline but leaves the coverage justification for real-world diversity unshown.

read the letter

The main point is that this paper builds a larger multimodal fusion benchmark than prior efforts like MultiBench, collecting over 30 datasets across 15 modalities and 20 tasks plus an open-source automated evaluation pipeline. That is the concrete new artifact they deliver, and they use it to run experiments that set fresh baselines on multiple tasks. The pipeline with standardized model implementations is a practical step that could make comparisons easier and more consistent across groups. It directly targets the real problem of narrow dataset testing that leads to overfitting and incomparable results. The scale and tooling are useful additions for anyone trying to move beyond single-dataset evaluations in applied areas. The soft spot sits in the dataset curation. The abstract and stress-test note both point to the same gap: no explicit taxonomy, quantitative coverage check, or justification that the chosen collection avoids systematic under-representation of modality mixes or application regimes. Without that mapping, the claim that the benchmark is domain-adaptive and representative rests on an assumption rather than demonstrated evidence. If the full paper supplies a clear selection rationale and diversity analysis, the concern shrinks; if not, it stays a load-bearing weakness for the unified framing. This is the kind of work that multimodal fusion researchers and people building models for specialized domains would actually use. It gives them a shared platform and baselines to test against. The thinking is straightforward and honest about the evaluation bottleneck in the field. I would bring it to a reading group to walk through the exact dataset list and see how the experiments were run. It deserves peer review because the artifact is substantial enough to be worth referee time even if the coverage argument needs more detail in revision.

Referee Report

1 major / 1 minor

Summary. The manuscript introduces MULTIBENCH++, a large-scale domain-adaptive benchmark that integrates over 30 datasets spanning 15 modalities and 20 predictive tasks across key application domains, together with an open-source unified automated evaluation pipeline containing standardized implementations of state-of-the-art models and fusion paradigms; the authors report large-scale experiments that establish new performance baselines.

Significance. If the coverage and curation claims are substantiated, the benchmark and pipeline would supply the community with a much-needed unified platform for reproducible, cross-method comparisons of multimodal fusion techniques, reducing the risk of dataset-specific overfitting and supporting more reliable progress in multimodal AI.

major comments (1)

[Abstract] Abstract: the central claim that the collection of over 30 datasets 'adequately represents the complexity and diversity of real-world scenarios' is load-bearing for the 'domain-adaptive' and 'unified' properties, yet the manuscript supplies no explicit taxonomy of multimodal scenarios, quantitative coverage metrics, or justification that the chosen datasets avoid systematic under-representation of modality combinations or application regimes.

minor comments (1)

[Abstract] The abstract would be clearer if it briefly defined the 15 modalities and the 20 predictive tasks rather than leaving them as aggregate counts.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and positive evaluation of the benchmark's potential impact. We address the single major comment below and will revise the manuscript to strengthen the supporting arguments for our claims.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that the collection of over 30 datasets 'adequately represents the complexity and diversity of real-world scenarios' is load-bearing for the 'domain-adaptive' and 'unified' properties, yet the manuscript supplies no explicit taxonomy of multimodal scenarios, quantitative coverage metrics, or justification that the chosen datasets avoid systematic under-representation of modality combinations or application regimes.

Authors: We agree that the abstract's phrasing makes a strong claim that requires clearer substantiation to fully support the 'domain-adaptive' and 'unified' framing. The manuscript describes the curation process, including coverage of 15 modalities and 20 tasks drawn from domains such as healthcare, robotics, and multimedia, with rationale for dataset selection to promote diversity. However, it does not include an explicit taxonomy of multimodal scenarios or quantitative coverage metrics (e.g., modality-pair frequency or regime representation scores). We will revise the abstract to moderate the claim and add a new subsection (or appendix) in the main text that provides (1) a taxonomy organizing scenarios by modality combinations, task types, and application regimes, (2) quantitative metrics on coverage, and (3) explicit discussion of potential gaps and how the selected datasets mitigate systematic under-representation. These additions will be supported by tables and analysis of the 30+ datasets. revision: yes

Circularity Check

0 steps flagged

No circularity: benchmark is a constructive artifact

full rationale

The paper presents the creation of MULTIBENCH++ as the integration of over 30 existing public datasets spanning 15 modalities and 20 tasks, plus an open-source evaluation pipeline with standardized model implementations. This is a curation and standardization effort rather than a derivation chain involving equations, fitted parameters, or predictions that reduce to the inputs by construction. No self-definitional loops, fitted-input-as-prediction, or load-bearing self-citations appear in the abstract or described claims; the coverage assertions rest on explicit dataset selection rather than any mathematical equivalence to prior results. The work is self-contained as an empirical benchmarking platform.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The benchmark rests on the domain assumption that the curated collection of datasets provides representative coverage; no free parameters or invented entities are described in the abstract.

axioms (1)

domain assumption The integrated datasets and tasks represent the complexity and diversity of real-world multimodal scenarios
Invoked when claiming the benchmark addresses biased evaluations from limited public datasets.

pith-pipeline@v0.9.0 · 5544 in / 1262 out tokens · 43841 ms · 2026-05-17T23:16:41.139249+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

This benchmark integrates over 30 datasets, encompassing 15 modalities and 20 predictive tasks across key application domains.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We evaluate several Transformer-based architectures for multimodal feature fusion... Hierarchical Attention (Multi-to-One), Cross-Attention Fusion (CAF)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

71 extracted references · 71 canonical work pages · 3 internal anchors

[1]

, " * write output.state after.block = add.period write newline

ENTRY address archivePrefix author booktitle chapter edition editor eid eprint howpublished institution isbn journal key month note number organization pages publisher school series title type volume year label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block FUNCTION init.state.consts #0 'before.a...

work page
[2]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

work page
[3]

Akiba, T.; Sano, S.; Yanase, T.; Ohta, T.; and Koyama, M. 2019. Optuna: A next-generation hyperparameter optimization framework. In Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining, 2623--2631

work page 2019
[4]

Alain, G.; and Bengio, Y. 2016. Understanding intermediate layers using linear classifier probes. arXiv preprint arXiv:1610.01644

work page internal anchor Pith review Pith/arXiv arXiv 2016
[5]

Anumula, J.; Neil, D.; Delbruck, T.; and Liu, S.-C. 2018. Feature representations for neuromorphic audio spike streams. Frontiers in neuroscience, 12: 23

work page 2018
[6]

K.; Hossain, M

Atrey, P. K.; Hossain, M. A.; El Saddik, A.; and Kankanhalli, M. S. 2010. Multimodal fusion for multimedia analysis: a survey. Multimedia systems, 16(6): 345--379

work page 2010
[7]

A.; Khan, K

Azam, M. A.; Khan, K. B.; Salahuddin, S.; Rehman, E.; Khan, S. A.; Khan, M. A.; Kadry, S.; and Gandomi, A. H. 2022. A review on multimodal medical image fusion: Compendious analysis of medical modalities, multimodal databases, fusion techniques and quality metrics. Computers in biology and medicine, 144: 105253

work page 2022
[8]

Baltru s aitis, T.; Ahuja, C.; and Morency, L.-P. 2018. Multimodal machine learning: A survey and taxonomy. IEEE transactions on pattern analysis and machine intelligence, 41(2): 423--443

work page 2018
[9]

A.; Buchman, A

Bennett, D. A.; Buchman, A. S.; Boyle, P. A.; Barnes, L. L.; Wilson, R. S.; and Schneider, J. A. 2018. Religious orders study and rush memory and aging project. Journal of Alzheimer’s disease, 64(s1): S161--S189

work page 2018
[10]

N.; Lee, S.; and Narayanan, S

Busso, C.; Bulut, M.; Lee, C.-C.; Kazemzadeh, A.; Mower, E.; Kim, S.; Chang, J. N.; Lee, S.; and Narayanan, S. S. 2008. IEMOCAP: Interactive emotional dyadic motion capture database. Language resources and evaluation, 42(4): 335--359

work page 2008
[11]

H.; Vora, S.; Liong, V

Caesar, H.; Bankiti, V.; Lang, A. H.; Vora, S.; Liong, V. E.; Xu, Q.; Krishnan, A.; Pan, Y.; Baldan, G.; and Beijbom, O. 2020. nuscenes: A multimodal dataset for autonomous driving. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 11621--11631

work page 2020
[12]

Chen, D.; Su, W.; Wu, P.; and Hua, B. 2023. Joint multimodal sentiment analysis based on information relevance. Information Processing & Management, 60(2): 103193

work page 2023
[13]

Cohen, G.; Afshar, S.; Tapson, J.; and Van Schaik, A. 2017. EMNIST: Extending MNIST to handwritten letters. In 2017 international joint conference on neural networks (IJCNN), 2921--2926. IEEE

work page 2017
[14]

Debes, C.; Merentitis, A.; Heremans, R.; Hahn, J.; Frangiadakis, N.; Van Kasteren, T.; Liao, W.; Bellens, R.; Pi z urica, A.; Gautama, S.; et al. 2014. Hyperspectral and LiDAR data fusion: Outcome of the 2013 GRSS data fusion contest. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 7(6): 2405--2418

work page 2014
[15]

Fersini, E.; Gasparini, F.; Rizzi, G.; Saibene, A.; Chulvi, B.; Rosso, P.; Lees, A.; and Sorensen, J. 2022. SemEval-2022 Task 5: Multimedia automatic misogyny identification. In Proceedings of the 16th International Workshop on Semantic Evaluation (SemEval-2022), 533--549

work page 2022
[16]

S.; Shlens, J.; Bengio, S.; Dean, J.; Ranzato, M.; and Mikolov, T

Frome, A.; Corrado, G. S.; Shlens, J.; Bengio, S.; Dean, J.; Ranzato, M.; and Mikolov, T. 2013. Devise: A deep visual-semantic embedding model. Advances in neural information processing systems, 26

work page 2013
[17]

Gader, P.; Zare, A.; Close, R.; Aitken, J.; and Tuell, G. 2013. MUUFL Gulfport hyperspectral and LiDAR airborne data set. Univ. Florida, Gainesville, FL, USA, Tech. Rep. REP-2013-570

work page 2013
[18]

Goyal, Y.; Khot, T.; Summers-Stay, D.; Batra, D.; and Parikh, D. 2017. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, 6904--6913

work page 2017
[19]

Han, Z.; Zhang, C.; Fu, H.; and Zhou, J. T. 2022. Trusted multi-view classification with dynamic evidential fusion. IEEE transactions on pattern analysis and machine intelligence, 45(2): 2551--2566

work page 2022
[20]

Hossain, E.; Sharif, O.; and Hoque, M. M. 2022. MUTE: A multimodal dataset for detecting hateful memes. In Proceedings of the 2nd conference of the asia-pacific chapter of the association for computational linguistics and the 12th international joint conference on natural language processing: student research workshop, 32--39

work page 2022
[21]

Hu, J.; Liu, R.; Hong, D.; Camero, A.; Yao, J.; Schneider, M.; Kurz, F.; Segl, K.; and Zhu, X. X. 2023. MDAS: A new multimodal benchmark dataset for remote sensing. Earth System Science Data, 15(1): 113--131

work page 2023
[22]

J.; and Lew, M

Huiskes, M. J.; and Lew, M. S. 2008. The mir flickr retrieval evaluation. In Proceedings of the 1st ACM international conference on Multimedia information retrieval, 39--43

work page 2008
[23]

Irvin, J.; Sheng, H.; Ramachandran, N.; Johnson-Yu, S.; Zhou, S.; Story, K.; Rustowicz, R.; Elsworth, C.; Austin, K.; and Ng, A. Y. 2020. Forestnet: Classifying drivers of deforestation in indonesia using deep learning on satellite imagery. arXiv preprint arXiv:2011.05479

work page arXiv 2020
[24]

E.; Pollard, T

Johnson, A. E.; Pollard, T. J.; Berkowitz, S. J.; Greenbaum, N. R.; Lungren, M. P.; Deng, C.-y.; Mark, R. G.; and Horng, S. 2019. MIMIC-CXR, a de-identified publicly available database of chest radiographs with free-text reports. Scientific data, 6(1): 317

work page 2019
[25]

E.; Pollard, T

Johnson, A. E.; Pollard, T. J.; Shen, L.; Lehman, L.-w. H.; Feng, M.; Ghassemi, M.; Moody, B.; Szolovits, P.; Anthony Celi, L.; and Mark, R. G. 2016. MIMIC-III, a freely accessible critical care database. Scientific data, 3(1): 1--9

work page 2016
[26]

Kawahara, J.; Daneshvar, S.; Argenziano, G.; and Hamarneh, G. 2018. Seven-point checklist and skin lesion classification using multitask multimodal neural nets. IEEE journal of biomedical and health informatics, 23(2): 538--546

work page 2018
[27]

Kiros, R.; Salakhutdinov, R.; and Zemel, R. S. 2014. Unifying visual-semantic embeddings with multimodal neural language models. arXiv preprint arXiv:1411.2539

work page internal anchor Pith review Pith/arXiv arXiv 2014
[28]

Kong, J.; Cooper, L. A. D.; Wang, F.; Gutman, D. A.; Gao, J.; Chisolm, C.; Sharma, A.; Pan, T.; Van Meir, E. G.; Kurc, T. M.; Moreno, C. S.; Saltz, J. H.; and Brat, D. J. 2011. Integrative, Multi-modal Analysis of Glioblastoma Using TCGA Molecular Data, Pathology Images and Clinical Outcomes. IEEE Transactions on Biomedical Engineering, 58(12): 3469--3474

work page 2011
[29]

A.; and Kanazawa, A

Li, R.; Yang, S.; Ross, D. A.; and Kanazawa, A. 2021. Ai choreographer: Music conditioned 3d dance generation with aist++. In Proceedings of the IEEE/CVF international conference on computer vision, 13401--13412

work page 2021
[30]

Li, Y.; Li, Y.; Wang, X.; Jiang, Y.; Zhang, Z.; Zheng, X.; Wang, H.; Zheng, H.-T.; Huang, F.; Zhou, J.; et al. 2024. Benchmarking Multimodal Retrieval Augmented Generation with Dynamic VQA Dataset and Self-adaptive Planning Agent. In The Thirteenth International Conference on Learning Representations

work page 2024
[31]

P.; Lyu, Y.; Fan, X.; Wu, Z.; Cheng, Y.; Wu, J.; Chen, L.; Wu, P.; Lee, M

Liang, P. P.; Lyu, Y.; Fan, X.; Wu, Z.; Cheng, Y.; Wu, J.; Chen, L.; Wu, P.; Lee, M. A.; Zhu, Y.; et al. 2021. Multibench: Multiscale benchmarks for multimodal representation learning. Advances in neural information processing systems, 2021(DB1): 1

work page 2021
[32]

P.; Zadeh, A.; and Morency, L.-P

Liang, P. P.; Zadeh, A.; and Morency, L.-P. 2024. Foundations & trends in multimodal machine learning: Principles, challenges, and open questions. ACM Computing Surveys, 56(10): 1--42

work page 2024
[33]

Lin, J.; Yang, A.; Zhang, Y.; Liu, J.; Zhou, J.; and Yang, H. 2020. Interbert: Vision-and-language interaction for multi-modal pretraining. arXiv preprint arXiv:2003.13198

work page arXiv 2020
[34]

Lin, N.; Wang, S.; Li, Y.; Wang, B.; Shi, S.; He, Y.; Zhang, W.; Yu, Y.; Zhang, Y.; Zhang, X.; et al. 2025. Resistive memory-based zero-shot liquid state machine for multimodal event data learning. Nature Computational Science, 5(1): 37--47

work page 2025
[35]

Liu, Y.; Yuan, Z.; Mao, H.; Liang, Z.; Yang, W.; Qiu, Y.; Cheng, T.; Li, X.; Xu, H.; and Gao, K. 2022. Make acoustic and visual cues matter: Ch-sims v2. 0 dataset and av-mixup consistent module. In Proceedings of the 2022 international conference on multimodal interaction, 247--258

work page 2022
[36]

Lu, D.; Neves, L.; Carvalho, V.; Zhang, N.; and Ji, H. 2018. Visual attention model for name tagging in multimodal social media. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 1990--1999

work page 2018
[37]

Lu, J.; Batra, D.; Parikh, D.; and Lee, S. 2019. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Advances in neural information processing systems, 32

work page 2019
[38]

Nagrani, A.; Yang, S.; Arnab, A.; Jansen, A.; Schmid, C.; and Sun, C. 2021. Attention bottlenecks for multimodal fusion. Advances in neural information processing systems, 34: 14200--14213

work page 2021
[39]

Niu, T.; Zhu, S.; Pang, L.; and El Saddik, A. 2016. Sentiment analysis on multi-view social data. In International conference on multimedia modeling, 15--27. Springer

work page 2016
[40]

Okujeni, A.; van der Linden, S.; and Hostert, P. 2016. Berlin-urban-gradient dataset 2009-an enmap preparatory flight campaign

work page 2016
[41]

K.; and Thakor, N

Orchard, G.; Jayawant, A.; Cohen, G. K.; and Thakor, N. 2015. Converting static image datasets to spiking neuromorphic datasets using saccades. Frontiers in neuroscience, 9: 437

work page 2015
[42]

J.; Johnson, A

Pollard, T. J.; Johnson, A. E.; Raffa, J. D.; Celi, L. A.; Mark, R. G.; and Badawi, O. 2018. The eICU Collaborative Research Database, a freely available multi-center database for critical care research. Scientific data, 5(1): 1--13

work page 2018
[43]

Poria, S.; Hazarika, D.; Majumder, N.; Naik, G.; Cambria, E.; and Mihalcea, R. 2019. MELD: A Multimodal Multi-Party Dataset for Emotion Recognition in Conversations. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 527--536

work page 2019
[44]

Ramachandram, D.; and Taylor, G. W. 2017. Deep multimodal learning: A survey on recent advances and trends. IEEE signal processing magazine, 34(6): 96--108

work page 2017
[45]

Rotemberg, V.; Kurtansky, N.; Betz-Stablein, B.; Caffery, L.; Chousakos, E.; Codella, N.; Combalia, M.; Dusza, S.; Guitera, P.; Gutman, D.; et al. 2021. A patient-centric dataset of images and metadata for identifying melanomas using clinical context. Scientific data, 8(1): 34

work page 2021
[46]

Sharma, C.; Bhageria, D.; Scott, W.; Pykl, S.; Das, A.; Chakraborty, T.; Pulabaigari, V.; and Gamb \"a ck, B. 2020. SemEval-2020 Task 8: Memotion Analysis-the Visuo-Lingual Metaphor! In Proceedings of the Fourteenth Workshop on Semantic Evaluation, 759--773

work page 2020
[47]

Shi, Y.; Paige, B.; Torr, P.; et al. 2019. Variational mixture-of-experts autoencoders for multi-modal deep generative models. Advances in neural information processing systems, 32

work page 2019
[48]

Silberman, N.; Hoiem, D.; Kohli, P.; and Fergus, R. 2012. Indoor segmentation and support inference from rgbd images. In European conference on computer vision, 746--760. Springer

work page 2012
[49]

Soleymani, M.; Garcia, D.; Jou, B.; Schuller, B.; Chang, S.-F.; and Pantic, M. 2017. A survey of multimodal sentiment analysis. Image and Vision Computing, 65: 3--14

work page 2017
[50]

P.; and Xiao, J

Song, S.; Lichtenberg, S. P.; and Xiao, J. 2015. Sun rgb-d: A rgb-d scene understanding benchmark suite. In Proceedings of the IEEE conference on computer vision and pattern recognition, 567--576

work page 2015
[51]

R.; Arcan, M.; and Buitelaar, P

Suryawanshi, S.; Chakravarthi, B. R.; Arcan, M.; and Buitelaar, P. 2020. Multimodal meme dataset (MultiOFF) for identifying offensive content in image and text. In Proceedings of the second workshop on trolling, aggression and cyberbullying, 32--41

work page 2020
[52]

Tao, M.; Huang, Q.; Xu, K.; Chen, L.; Feng, Y.; and Zhao, D. 2024. Probing Multimodal Large Language Models for Global and Local Semantic Representations. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), 13050--13056

work page 2024
[53]

H.; Bai, S.; Liang, P

Tsai, Y.-H. H.; Bai, S.; Liang, P. P.; Kolter, J. Z.; Morency, L.-P.; and Salakhutdinov, R. 2019. Multimodal transformer for unaligned multimodal language sequences. In Proceedings of the conference. Association for computational linguistics. Meeting, volume 2019, 6558

work page 2019
[54]

University of Trento . 2022. Theses of the University of Trento. [Data set]. Original work published 2020

work page 2022
[55]

Wang, X.; Kumar, D.; Thome, N.; Cord, M.; and Precioso, F. 2015. Recipe recognition with large multimodal food dataset. In 2015 IEEE International Conference on Multimedia & Expo Workshops (ICMEW), 1--6. IEEE

work page 2015
[56]

Wang, Y.; Chen, X.; Cao, L.; Huang, W.; Sun, F.; and Wang, Y. 2022. Multimodal Token Fusion for Vision Transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition ( CVPR )

work page 2022
[57]

Wei, X.; Zhang, T.; Li, Y.; Zhang, Y.; and Wu, F. 2020. Multi-Modality Cross Attention Network for Image and Sentence Matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition ( CVPR )

work page 2020
[58]

N.; Collisson, E

Weinstein, J. N.; Collisson, E. A.; Mills, G. B.; Shaw, K. R.; Ozenberger, B. A.; Ellrott, K.; Shmulevich, I.; Sander, C.; and Stuart, J. M. 2013. The cancer genome atlas pan-cancer analysis project. Nature genetics, 45(10): 1113--1120

work page 2013
[59]

R.; Avansino, D

Willett, F. R.; Avansino, D. T.; Hochberg, L. R.; Henderson, J. M.; and Shenoy, K. V. 2021. High-performance brain-to-text communication via handwriting. Nature, 593(7858): 249--254

work page 2021
[60]

Wu, J.; Fang, H.; Li, F.; Fu, H.; Lin, F.; Li, J.; Huang, Y.; Yu, Q.; Song, S.; Xu, X.; et al. 2023. Gamma challenge: glaucoma grading from multi-modality images. Medical Image Analysis, 90: 102938

work page 2023
[61]

Xu, B.; Li, T.; Zheng, J.; Naseriparsa, M.; Zhao, Z.; Lin, H.; and Xia, F. 2022. Met-meme: A multimodal meme dataset rich in metaphors. In Proceedings of the 45th international ACM SIGIR conference on research and development in information retrieval, 2887--2899

work page 2022
[62]

Xu, P.; Zhu, X.; and Clifton, D. A. 2023. Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(10): 12113--12132

work page 2023
[63]

Xu, Y.; Du, B.; Zhang, F.; and Zhang, L. 2018. Hyperspectral image classification via a random patches network. ISPRS journal of photogrammetry and remote sensing, 142: 344--357

work page 2018
[64]

Yu, W.; Xu, H.; Meng, F.; Zhu, Y.; Ma, Y.; Wu, J.; Zou, J.; and Yang, K. 2020. Ch-sims: A chinese multimodal sentiment analysis dataset with fine-grained annotation of modality. In Proceedings of the 58th annual meeting of the association for computational linguistics, 3718--3727

work page 2020
[65]

Zadeh, A.; Zellers, R.; Pincus, E.; and Morency, L.-P. 2016. Mosi: multimodal corpus of sentiment intensity and subjectivity analysis in online opinion videos. arXiv preprint arXiv:1606.06259

work page internal anchor Pith review Pith/arXiv arXiv 2016
[66]

B.; Liang, P

Zadeh, A. B.; Liang, P. P.; Poria, S.; Cambria, E.; and Morency, L.-P. 2018. Multimodal language analysis in the wild: Cmu-mosei dataset and interpretable dynamic fusion graph. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2236--2246

work page 2018
[67]

Zhan, X.; Wu, Y.; Dong, X.; Wei, Y.; Lu, M.; Zhang, Y.; Xu, H.; and Liang, X. 2021. Product1m: Towards weakly supervised instance-level product retrieval via cross-modal pretraining. In Proceedings of the IEEE/CVF international conference on computer vision, 11782--11791

work page 2021
[68]

Zhang, Q.; Fu, J.; Liu, X.; and Huang, X. 2018. Adaptive co-attention network for named entity recognition in tweets. In Proceedings of the AAAI conference on artificial intelligence, volume 32

work page 2018
[69]

Zhang, Q.; Wei, Y.; Han, Z.; Fu, H.; Peng, X.; Deng, C.; Hu, Q.; Xu, C.; Wen, J.; Hu, D.; et al. 2024. Multimodal fusion on low-quality data: A comprehensive survey. arXiv preprint arXiv:2404.18947

work page arXiv 2024
[70]

T.; and Peng, X

Zhang, Q.; Wu, H.; Zhang, C.; Hu, Q.; Fu, H.; Zhou, J. T.; and Peng, X. 2023. Provable dynamic fusion for low-quality multimodal data. In International conference on machine learning, 41753--41769. PMLR

work page 2023
[71]

Zhu, J.; Zhou, Y.; Qian, S.; He, Z.; Zhao, T.; Shah, N.; and Koutra, D. 2025. Mosaic of modalities: A comprehensive benchmark for multimodal graph learning. In Proceedings of the Computer Vision and Pattern Recognition Conference, 14215--14224

work page 2025

[1] [1]

, " * write output.state after.block = add.period write newline

ENTRY address archivePrefix author booktitle chapter edition editor eid eprint howpublished institution isbn journal key month note number organization pages publisher school series title type volume year label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block FUNCTION init.state.consts #0 'before.a...

work page

[2] [2]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

work page

[3] [3]

Akiba, T.; Sano, S.; Yanase, T.; Ohta, T.; and Koyama, M. 2019. Optuna: A next-generation hyperparameter optimization framework. In Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining, 2623--2631

work page 2019

[4] [4]

Alain, G.; and Bengio, Y. 2016. Understanding intermediate layers using linear classifier probes. arXiv preprint arXiv:1610.01644

work page internal anchor Pith review Pith/arXiv arXiv 2016

[5] [5]

Anumula, J.; Neil, D.; Delbruck, T.; and Liu, S.-C. 2018. Feature representations for neuromorphic audio spike streams. Frontiers in neuroscience, 12: 23

work page 2018

[6] [6]

K.; Hossain, M

Atrey, P. K.; Hossain, M. A.; El Saddik, A.; and Kankanhalli, M. S. 2010. Multimodal fusion for multimedia analysis: a survey. Multimedia systems, 16(6): 345--379

work page 2010

[7] [7]

A.; Khan, K

Azam, M. A.; Khan, K. B.; Salahuddin, S.; Rehman, E.; Khan, S. A.; Khan, M. A.; Kadry, S.; and Gandomi, A. H. 2022. A review on multimodal medical image fusion: Compendious analysis of medical modalities, multimodal databases, fusion techniques and quality metrics. Computers in biology and medicine, 144: 105253

work page 2022

[8] [8]

Baltru s aitis, T.; Ahuja, C.; and Morency, L.-P. 2018. Multimodal machine learning: A survey and taxonomy. IEEE transactions on pattern analysis and machine intelligence, 41(2): 423--443

work page 2018

[9] [9]

A.; Buchman, A

Bennett, D. A.; Buchman, A. S.; Boyle, P. A.; Barnes, L. L.; Wilson, R. S.; and Schneider, J. A. 2018. Religious orders study and rush memory and aging project. Journal of Alzheimer’s disease, 64(s1): S161--S189

work page 2018

[10] [10]

N.; Lee, S.; and Narayanan, S

Busso, C.; Bulut, M.; Lee, C.-C.; Kazemzadeh, A.; Mower, E.; Kim, S.; Chang, J. N.; Lee, S.; and Narayanan, S. S. 2008. IEMOCAP: Interactive emotional dyadic motion capture database. Language resources and evaluation, 42(4): 335--359

work page 2008

[11] [11]

H.; Vora, S.; Liong, V

Caesar, H.; Bankiti, V.; Lang, A. H.; Vora, S.; Liong, V. E.; Xu, Q.; Krishnan, A.; Pan, Y.; Baldan, G.; and Beijbom, O. 2020. nuscenes: A multimodal dataset for autonomous driving. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 11621--11631

work page 2020

[12] [12]

Chen, D.; Su, W.; Wu, P.; and Hua, B. 2023. Joint multimodal sentiment analysis based on information relevance. Information Processing & Management, 60(2): 103193

work page 2023

[13] [13]

Cohen, G.; Afshar, S.; Tapson, J.; and Van Schaik, A. 2017. EMNIST: Extending MNIST to handwritten letters. In 2017 international joint conference on neural networks (IJCNN), 2921--2926. IEEE

work page 2017

[14] [14]

Debes, C.; Merentitis, A.; Heremans, R.; Hahn, J.; Frangiadakis, N.; Van Kasteren, T.; Liao, W.; Bellens, R.; Pi z urica, A.; Gautama, S.; et al. 2014. Hyperspectral and LiDAR data fusion: Outcome of the 2013 GRSS data fusion contest. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 7(6): 2405--2418

work page 2014

[15] [15]

Fersini, E.; Gasparini, F.; Rizzi, G.; Saibene, A.; Chulvi, B.; Rosso, P.; Lees, A.; and Sorensen, J. 2022. SemEval-2022 Task 5: Multimedia automatic misogyny identification. In Proceedings of the 16th International Workshop on Semantic Evaluation (SemEval-2022), 533--549

work page 2022

[16] [16]

S.; Shlens, J.; Bengio, S.; Dean, J.; Ranzato, M.; and Mikolov, T

Frome, A.; Corrado, G. S.; Shlens, J.; Bengio, S.; Dean, J.; Ranzato, M.; and Mikolov, T. 2013. Devise: A deep visual-semantic embedding model. Advances in neural information processing systems, 26

work page 2013

[17] [17]

Gader, P.; Zare, A.; Close, R.; Aitken, J.; and Tuell, G. 2013. MUUFL Gulfport hyperspectral and LiDAR airborne data set. Univ. Florida, Gainesville, FL, USA, Tech. Rep. REP-2013-570

work page 2013

[18] [18]

Goyal, Y.; Khot, T.; Summers-Stay, D.; Batra, D.; and Parikh, D. 2017. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, 6904--6913

work page 2017

[19] [19]

Han, Z.; Zhang, C.; Fu, H.; and Zhou, J. T. 2022. Trusted multi-view classification with dynamic evidential fusion. IEEE transactions on pattern analysis and machine intelligence, 45(2): 2551--2566

work page 2022

[20] [20]

Hossain, E.; Sharif, O.; and Hoque, M. M. 2022. MUTE: A multimodal dataset for detecting hateful memes. In Proceedings of the 2nd conference of the asia-pacific chapter of the association for computational linguistics and the 12th international joint conference on natural language processing: student research workshop, 32--39

work page 2022

[21] [21]

Hu, J.; Liu, R.; Hong, D.; Camero, A.; Yao, J.; Schneider, M.; Kurz, F.; Segl, K.; and Zhu, X. X. 2023. MDAS: A new multimodal benchmark dataset for remote sensing. Earth System Science Data, 15(1): 113--131

work page 2023

[22] [22]

J.; and Lew, M

Huiskes, M. J.; and Lew, M. S. 2008. The mir flickr retrieval evaluation. In Proceedings of the 1st ACM international conference on Multimedia information retrieval, 39--43

work page 2008

[23] [23]

Irvin, J.; Sheng, H.; Ramachandran, N.; Johnson-Yu, S.; Zhou, S.; Story, K.; Rustowicz, R.; Elsworth, C.; Austin, K.; and Ng, A. Y. 2020. Forestnet: Classifying drivers of deforestation in indonesia using deep learning on satellite imagery. arXiv preprint arXiv:2011.05479

work page arXiv 2020

[24] [24]

E.; Pollard, T

Johnson, A. E.; Pollard, T. J.; Berkowitz, S. J.; Greenbaum, N. R.; Lungren, M. P.; Deng, C.-y.; Mark, R. G.; and Horng, S. 2019. MIMIC-CXR, a de-identified publicly available database of chest radiographs with free-text reports. Scientific data, 6(1): 317

work page 2019

[25] [25]

E.; Pollard, T

Johnson, A. E.; Pollard, T. J.; Shen, L.; Lehman, L.-w. H.; Feng, M.; Ghassemi, M.; Moody, B.; Szolovits, P.; Anthony Celi, L.; and Mark, R. G. 2016. MIMIC-III, a freely accessible critical care database. Scientific data, 3(1): 1--9

work page 2016

[26] [26]

Kawahara, J.; Daneshvar, S.; Argenziano, G.; and Hamarneh, G. 2018. Seven-point checklist and skin lesion classification using multitask multimodal neural nets. IEEE journal of biomedical and health informatics, 23(2): 538--546

work page 2018

[27] [27]

Kiros, R.; Salakhutdinov, R.; and Zemel, R. S. 2014. Unifying visual-semantic embeddings with multimodal neural language models. arXiv preprint arXiv:1411.2539

work page internal anchor Pith review Pith/arXiv arXiv 2014

[28] [28]

Kong, J.; Cooper, L. A. D.; Wang, F.; Gutman, D. A.; Gao, J.; Chisolm, C.; Sharma, A.; Pan, T.; Van Meir, E. G.; Kurc, T. M.; Moreno, C. S.; Saltz, J. H.; and Brat, D. J. 2011. Integrative, Multi-modal Analysis of Glioblastoma Using TCGA Molecular Data, Pathology Images and Clinical Outcomes. IEEE Transactions on Biomedical Engineering, 58(12): 3469--3474

work page 2011

[29] [29]

A.; and Kanazawa, A

Li, R.; Yang, S.; Ross, D. A.; and Kanazawa, A. 2021. Ai choreographer: Music conditioned 3d dance generation with aist++. In Proceedings of the IEEE/CVF international conference on computer vision, 13401--13412

work page 2021

[30] [30]

Li, Y.; Li, Y.; Wang, X.; Jiang, Y.; Zhang, Z.; Zheng, X.; Wang, H.; Zheng, H.-T.; Huang, F.; Zhou, J.; et al. 2024. Benchmarking Multimodal Retrieval Augmented Generation with Dynamic VQA Dataset and Self-adaptive Planning Agent. In The Thirteenth International Conference on Learning Representations

work page 2024

[31] [31]

P.; Lyu, Y.; Fan, X.; Wu, Z.; Cheng, Y.; Wu, J.; Chen, L.; Wu, P.; Lee, M

Liang, P. P.; Lyu, Y.; Fan, X.; Wu, Z.; Cheng, Y.; Wu, J.; Chen, L.; Wu, P.; Lee, M. A.; Zhu, Y.; et al. 2021. Multibench: Multiscale benchmarks for multimodal representation learning. Advances in neural information processing systems, 2021(DB1): 1

work page 2021

[32] [32]

P.; Zadeh, A.; and Morency, L.-P

Liang, P. P.; Zadeh, A.; and Morency, L.-P. 2024. Foundations & trends in multimodal machine learning: Principles, challenges, and open questions. ACM Computing Surveys, 56(10): 1--42

work page 2024

[33] [33]

Lin, J.; Yang, A.; Zhang, Y.; Liu, J.; Zhou, J.; and Yang, H. 2020. Interbert: Vision-and-language interaction for multi-modal pretraining. arXiv preprint arXiv:2003.13198

work page arXiv 2020

[34] [34]

Lin, N.; Wang, S.; Li, Y.; Wang, B.; Shi, S.; He, Y.; Zhang, W.; Yu, Y.; Zhang, Y.; Zhang, X.; et al. 2025. Resistive memory-based zero-shot liquid state machine for multimodal event data learning. Nature Computational Science, 5(1): 37--47

work page 2025

[35] [35]

Liu, Y.; Yuan, Z.; Mao, H.; Liang, Z.; Yang, W.; Qiu, Y.; Cheng, T.; Li, X.; Xu, H.; and Gao, K. 2022. Make acoustic and visual cues matter: Ch-sims v2. 0 dataset and av-mixup consistent module. In Proceedings of the 2022 international conference on multimodal interaction, 247--258

work page 2022

[36] [36]

Lu, D.; Neves, L.; Carvalho, V.; Zhang, N.; and Ji, H. 2018. Visual attention model for name tagging in multimodal social media. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 1990--1999

work page 2018

[37] [37]

Lu, J.; Batra, D.; Parikh, D.; and Lee, S. 2019. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Advances in neural information processing systems, 32

work page 2019

[38] [38]

Nagrani, A.; Yang, S.; Arnab, A.; Jansen, A.; Schmid, C.; and Sun, C. 2021. Attention bottlenecks for multimodal fusion. Advances in neural information processing systems, 34: 14200--14213

work page 2021

[39] [39]

Niu, T.; Zhu, S.; Pang, L.; and El Saddik, A. 2016. Sentiment analysis on multi-view social data. In International conference on multimedia modeling, 15--27. Springer

work page 2016

[40] [40]

Okujeni, A.; van der Linden, S.; and Hostert, P. 2016. Berlin-urban-gradient dataset 2009-an enmap preparatory flight campaign

work page 2016

[41] [41]

K.; and Thakor, N

Orchard, G.; Jayawant, A.; Cohen, G. K.; and Thakor, N. 2015. Converting static image datasets to spiking neuromorphic datasets using saccades. Frontiers in neuroscience, 9: 437

work page 2015

[42] [42]

J.; Johnson, A

Pollard, T. J.; Johnson, A. E.; Raffa, J. D.; Celi, L. A.; Mark, R. G.; and Badawi, O. 2018. The eICU Collaborative Research Database, a freely available multi-center database for critical care research. Scientific data, 5(1): 1--13

work page 2018

[43] [43]

Poria, S.; Hazarika, D.; Majumder, N.; Naik, G.; Cambria, E.; and Mihalcea, R. 2019. MELD: A Multimodal Multi-Party Dataset for Emotion Recognition in Conversations. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 527--536

work page 2019

[44] [44]

Ramachandram, D.; and Taylor, G. W. 2017. Deep multimodal learning: A survey on recent advances and trends. IEEE signal processing magazine, 34(6): 96--108

work page 2017

[45] [45]

Rotemberg, V.; Kurtansky, N.; Betz-Stablein, B.; Caffery, L.; Chousakos, E.; Codella, N.; Combalia, M.; Dusza, S.; Guitera, P.; Gutman, D.; et al. 2021. A patient-centric dataset of images and metadata for identifying melanomas using clinical context. Scientific data, 8(1): 34

work page 2021

[46] [46]

Sharma, C.; Bhageria, D.; Scott, W.; Pykl, S.; Das, A.; Chakraborty, T.; Pulabaigari, V.; and Gamb \"a ck, B. 2020. SemEval-2020 Task 8: Memotion Analysis-the Visuo-Lingual Metaphor! In Proceedings of the Fourteenth Workshop on Semantic Evaluation, 759--773

work page 2020

[47] [47]

Shi, Y.; Paige, B.; Torr, P.; et al. 2019. Variational mixture-of-experts autoencoders for multi-modal deep generative models. Advances in neural information processing systems, 32

work page 2019

[48] [48]

Silberman, N.; Hoiem, D.; Kohli, P.; and Fergus, R. 2012. Indoor segmentation and support inference from rgbd images. In European conference on computer vision, 746--760. Springer

work page 2012

[49] [49]

Soleymani, M.; Garcia, D.; Jou, B.; Schuller, B.; Chang, S.-F.; and Pantic, M. 2017. A survey of multimodal sentiment analysis. Image and Vision Computing, 65: 3--14

work page 2017

[50] [50]

P.; and Xiao, J

Song, S.; Lichtenberg, S. P.; and Xiao, J. 2015. Sun rgb-d: A rgb-d scene understanding benchmark suite. In Proceedings of the IEEE conference on computer vision and pattern recognition, 567--576

work page 2015

[51] [51]

R.; Arcan, M.; and Buitelaar, P

Suryawanshi, S.; Chakravarthi, B. R.; Arcan, M.; and Buitelaar, P. 2020. Multimodal meme dataset (MultiOFF) for identifying offensive content in image and text. In Proceedings of the second workshop on trolling, aggression and cyberbullying, 32--41

work page 2020

[52] [52]

Tao, M.; Huang, Q.; Xu, K.; Chen, L.; Feng, Y.; and Zhao, D. 2024. Probing Multimodal Large Language Models for Global and Local Semantic Representations. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), 13050--13056

work page 2024

[53] [53]

H.; Bai, S.; Liang, P

Tsai, Y.-H. H.; Bai, S.; Liang, P. P.; Kolter, J. Z.; Morency, L.-P.; and Salakhutdinov, R. 2019. Multimodal transformer for unaligned multimodal language sequences. In Proceedings of the conference. Association for computational linguistics. Meeting, volume 2019, 6558

work page 2019

[54] [54]

University of Trento . 2022. Theses of the University of Trento. [Data set]. Original work published 2020

work page 2022

[55] [55]

Wang, X.; Kumar, D.; Thome, N.; Cord, M.; and Precioso, F. 2015. Recipe recognition with large multimodal food dataset. In 2015 IEEE International Conference on Multimedia & Expo Workshops (ICMEW), 1--6. IEEE

work page 2015

[56] [56]

Wang, Y.; Chen, X.; Cao, L.; Huang, W.; Sun, F.; and Wang, Y. 2022. Multimodal Token Fusion for Vision Transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition ( CVPR )

work page 2022

[57] [57]

Wei, X.; Zhang, T.; Li, Y.; Zhang, Y.; and Wu, F. 2020. Multi-Modality Cross Attention Network for Image and Sentence Matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition ( CVPR )

work page 2020

[58] [58]

N.; Collisson, E

Weinstein, J. N.; Collisson, E. A.; Mills, G. B.; Shaw, K. R.; Ozenberger, B. A.; Ellrott, K.; Shmulevich, I.; Sander, C.; and Stuart, J. M. 2013. The cancer genome atlas pan-cancer analysis project. Nature genetics, 45(10): 1113--1120

work page 2013

[59] [59]

R.; Avansino, D

Willett, F. R.; Avansino, D. T.; Hochberg, L. R.; Henderson, J. M.; and Shenoy, K. V. 2021. High-performance brain-to-text communication via handwriting. Nature, 593(7858): 249--254

work page 2021

[60] [60]

Wu, J.; Fang, H.; Li, F.; Fu, H.; Lin, F.; Li, J.; Huang, Y.; Yu, Q.; Song, S.; Xu, X.; et al. 2023. Gamma challenge: glaucoma grading from multi-modality images. Medical Image Analysis, 90: 102938

work page 2023

[61] [61]

Xu, B.; Li, T.; Zheng, J.; Naseriparsa, M.; Zhao, Z.; Lin, H.; and Xia, F. 2022. Met-meme: A multimodal meme dataset rich in metaphors. In Proceedings of the 45th international ACM SIGIR conference on research and development in information retrieval, 2887--2899

work page 2022

[62] [62]

Xu, P.; Zhu, X.; and Clifton, D. A. 2023. Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(10): 12113--12132

work page 2023

[63] [63]

Xu, Y.; Du, B.; Zhang, F.; and Zhang, L. 2018. Hyperspectral image classification via a random patches network. ISPRS journal of photogrammetry and remote sensing, 142: 344--357

work page 2018

[64] [64]

Yu, W.; Xu, H.; Meng, F.; Zhu, Y.; Ma, Y.; Wu, J.; Zou, J.; and Yang, K. 2020. Ch-sims: A chinese multimodal sentiment analysis dataset with fine-grained annotation of modality. In Proceedings of the 58th annual meeting of the association for computational linguistics, 3718--3727

work page 2020

[65] [65]

Zadeh, A.; Zellers, R.; Pincus, E.; and Morency, L.-P. 2016. Mosi: multimodal corpus of sentiment intensity and subjectivity analysis in online opinion videos. arXiv preprint arXiv:1606.06259

work page internal anchor Pith review Pith/arXiv arXiv 2016

[66] [66]

B.; Liang, P

Zadeh, A. B.; Liang, P. P.; Poria, S.; Cambria, E.; and Morency, L.-P. 2018. Multimodal language analysis in the wild: Cmu-mosei dataset and interpretable dynamic fusion graph. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2236--2246

work page 2018

[67] [67]

Zhan, X.; Wu, Y.; Dong, X.; Wei, Y.; Lu, M.; Zhang, Y.; Xu, H.; and Liang, X. 2021. Product1m: Towards weakly supervised instance-level product retrieval via cross-modal pretraining. In Proceedings of the IEEE/CVF international conference on computer vision, 11782--11791

work page 2021

[68] [68]

Zhang, Q.; Fu, J.; Liu, X.; and Huang, X. 2018. Adaptive co-attention network for named entity recognition in tweets. In Proceedings of the AAAI conference on artificial intelligence, volume 32

work page 2018

[69] [69]

Zhang, Q.; Wei, Y.; Han, Z.; Fu, H.; Peng, X.; Deng, C.; Hu, Q.; Xu, C.; Wen, J.; Hu, D.; et al. 2024. Multimodal fusion on low-quality data: A comprehensive survey. arXiv preprint arXiv:2404.18947

work page arXiv 2024

[70] [70]

T.; and Peng, X

Zhang, Q.; Wu, H.; Zhang, C.; Hu, Q.; Fu, H.; Zhou, J. T.; and Peng, X. 2023. Provable dynamic fusion for low-quality multimodal data. In International conference on machine learning, 41753--41769. PMLR

work page 2023

[71] [71]

Zhu, J.; Zhou, Y.; Qian, S.; He, Z.; Zhao, T.; Shah, N.; and Koutra, D. 2025. Mosaic of modalities: A comprehensive benchmark for multimodal graph learning. In Proceedings of the Computer Vision and Pattern Recognition Conference, 14215--14224

work page 2025