MULTIBENCH++: A Unified and Comprehensive Multimodal Fusion Benchmarking Across Specialized Domains
Pith reviewed 2026-05-17 23:16 UTC · model grok-4.3
The pith
A new benchmark unites over 30 datasets across 15 modalities and 20 tasks to create standardized tests for multimodal fusion methods.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By integrating over 30 datasets that cover 15 modalities and 20 predictive tasks across key application domains, and by supplying an open-source unified automated evaluation pipeline with standardized implementations of state-of-the-art models and fusion paradigms, the work enables large-scale experiments that establish new performance baselines and supply the community with a platform for rigorous, reproducible assessment of multimodal models.
What carries the argument
The MULTIBENCH++ benchmark together with its automated evaluation pipeline, which standardizes datasets, tasks, models, and fusion methods for consistent testing.
If this is right
- Fusion methods can be compared directly on a shared set of tasks instead of isolated datasets.
- Models are less likely to overfit to the quirks of any single dataset and more likely to generalize.
- New baselines provide concrete targets for future improvements in multimodal performance.
- The automated pipeline makes it easier to reproduce results and test additional methods quickly.
Where Pith is reading between the lines
- The benchmark could highlight which fusion strategies succeed in specialized settings such as medical imaging or robotics.
- Extending the platform with new modalities or tasks would require only minimal changes to the pipeline structure.
- Researchers might adapt the evaluation code to study trade-offs between accuracy, speed, and data requirements across domains.
Load-bearing premise
The selected datasets and tasks capture the full complexity and variety of real-world multimodal problems without major biases or gaps that would distort evaluations.
What would settle it
Running the same top methods from this benchmark on a fresh collection of multimodal datasets drawn from domains not represented in the 30+ datasets and observing consistent drops in relative performance would indicate the benchmark's coverage is incomplete.
Figures
read the original abstract
Although multimodal fusion has made significant progress, its advancement is severely hindered by the lack of adequate evaluation benchmarks. Current fusion methods are typically evaluated on a small selection of public datasets, a limited scope that inadequately represents the complexity and diversity of real-world scenarios, potentially leading to biased evaluations. This issue presents a twofold challenge. On one hand, models may overfit to the biases of specific datasets, hindering their generalization to broader practical applications. On the other hand, the absence of a unified evaluation standard makes fair and objective comparisons between different fusion methods difficult. Consequently, a truly universal and high-performance fusion model has yet to emerge. To address these challenges, we have developed a large-scale, domain-adaptive benchmark for multimodal evaluation. This benchmark integrates over 30 datasets, encompassing 15 modalities and 20 predictive tasks across key application domains. To complement this, we have also developed an open-source, unified, and automated evaluation pipeline that includes standardized implementations of state-of-the-art models and diverse fusion paradigms. Leveraging this platform, we have conducted large-scale experiments, successfully establishing new performance baselines across multiple tasks. This work provides the academic community with a crucial platform for rigorous and reproducible assessment of multimodal models, aiming to propel the field of multimodal artificial intelligence to new heights.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces MULTIBENCH++, a large-scale domain-adaptive benchmark that integrates over 30 datasets spanning 15 modalities and 20 predictive tasks across key application domains, together with an open-source unified automated evaluation pipeline containing standardized implementations of state-of-the-art models and fusion paradigms; the authors report large-scale experiments that establish new performance baselines.
Significance. If the coverage and curation claims are substantiated, the benchmark and pipeline would supply the community with a much-needed unified platform for reproducible, cross-method comparisons of multimodal fusion techniques, reducing the risk of dataset-specific overfitting and supporting more reliable progress in multimodal AI.
major comments (1)
- [Abstract] Abstract: the central claim that the collection of over 30 datasets 'adequately represents the complexity and diversity of real-world scenarios' is load-bearing for the 'domain-adaptive' and 'unified' properties, yet the manuscript supplies no explicit taxonomy of multimodal scenarios, quantitative coverage metrics, or justification that the chosen datasets avoid systematic under-representation of modality combinations or application regimes.
minor comments (1)
- [Abstract] The abstract would be clearer if it briefly defined the 15 modalities and the 20 predictive tasks rather than leaving them as aggregate counts.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and positive evaluation of the benchmark's potential impact. We address the single major comment below and will revise the manuscript to strengthen the supporting arguments for our claims.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central claim that the collection of over 30 datasets 'adequately represents the complexity and diversity of real-world scenarios' is load-bearing for the 'domain-adaptive' and 'unified' properties, yet the manuscript supplies no explicit taxonomy of multimodal scenarios, quantitative coverage metrics, or justification that the chosen datasets avoid systematic under-representation of modality combinations or application regimes.
Authors: We agree that the abstract's phrasing makes a strong claim that requires clearer substantiation to fully support the 'domain-adaptive' and 'unified' framing. The manuscript describes the curation process, including coverage of 15 modalities and 20 tasks drawn from domains such as healthcare, robotics, and multimedia, with rationale for dataset selection to promote diversity. However, it does not include an explicit taxonomy of multimodal scenarios or quantitative coverage metrics (e.g., modality-pair frequency or regime representation scores). We will revise the abstract to moderate the claim and add a new subsection (or appendix) in the main text that provides (1) a taxonomy organizing scenarios by modality combinations, task types, and application regimes, (2) quantitative metrics on coverage, and (3) explicit discussion of potential gaps and how the selected datasets mitigate systematic under-representation. These additions will be supported by tables and analysis of the 30+ datasets. revision: yes
Circularity Check
No circularity: benchmark is a constructive artifact
full rationale
The paper presents the creation of MULTIBENCH++ as the integration of over 30 existing public datasets spanning 15 modalities and 20 tasks, plus an open-source evaluation pipeline with standardized model implementations. This is a curation and standardization effort rather than a derivation chain involving equations, fitted parameters, or predictions that reduce to the inputs by construction. No self-definitional loops, fitted-input-as-prediction, or load-bearing self-citations appear in the abstract or described claims; the coverage assertions rest on explicit dataset selection rather than any mathematical equivalence to prior results. The work is self-contained as an empirical benchmarking platform.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The integrated datasets and tasks represent the complexity and diversity of real-world multimodal scenarios
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
This benchmark integrates over 30 datasets, encompassing 15 modalities and 20 predictive tasks across key application domains.
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We evaluate several Transformer-based architectures for multimodal feature fusion... Hierarchical Attention (Multi-to-One), Cross-Attention Fusion (CAF)
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
, " * write output.state after.block = add.period write newline
ENTRY address archivePrefix author booktitle chapter edition editor eid eprint howpublished institution isbn journal key month note number organization pages publisher school series title type volume year label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block FUNCTION init.state.consts #0 'before.a...
-
[2]
" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...
-
[3]
Akiba, T.; Sano, S.; Yanase, T.; Ohta, T.; and Koyama, M. 2019. Optuna: A next-generation hyperparameter optimization framework. In Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining, 2623--2631
work page 2019
-
[4]
Alain, G.; and Bengio, Y. 2016. Understanding intermediate layers using linear classifier probes. arXiv preprint arXiv:1610.01644
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[5]
Anumula, J.; Neil, D.; Delbruck, T.; and Liu, S.-C. 2018. Feature representations for neuromorphic audio spike streams. Frontiers in neuroscience, 12: 23
work page 2018
-
[6]
Atrey, P. K.; Hossain, M. A.; El Saddik, A.; and Kankanhalli, M. S. 2010. Multimodal fusion for multimedia analysis: a survey. Multimedia systems, 16(6): 345--379
work page 2010
-
[7]
Azam, M. A.; Khan, K. B.; Salahuddin, S.; Rehman, E.; Khan, S. A.; Khan, M. A.; Kadry, S.; and Gandomi, A. H. 2022. A review on multimodal medical image fusion: Compendious analysis of medical modalities, multimodal databases, fusion techniques and quality metrics. Computers in biology and medicine, 144: 105253
work page 2022
-
[8]
Baltru s aitis, T.; Ahuja, C.; and Morency, L.-P. 2018. Multimodal machine learning: A survey and taxonomy. IEEE transactions on pattern analysis and machine intelligence, 41(2): 423--443
work page 2018
-
[9]
Bennett, D. A.; Buchman, A. S.; Boyle, P. A.; Barnes, L. L.; Wilson, R. S.; and Schneider, J. A. 2018. Religious orders study and rush memory and aging project. Journal of Alzheimer’s disease, 64(s1): S161--S189
work page 2018
-
[10]
Busso, C.; Bulut, M.; Lee, C.-C.; Kazemzadeh, A.; Mower, E.; Kim, S.; Chang, J. N.; Lee, S.; and Narayanan, S. S. 2008. IEMOCAP: Interactive emotional dyadic motion capture database. Language resources and evaluation, 42(4): 335--359
work page 2008
-
[11]
Caesar, H.; Bankiti, V.; Lang, A. H.; Vora, S.; Liong, V. E.; Xu, Q.; Krishnan, A.; Pan, Y.; Baldan, G.; and Beijbom, O. 2020. nuscenes: A multimodal dataset for autonomous driving. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 11621--11631
work page 2020
-
[12]
Chen, D.; Su, W.; Wu, P.; and Hua, B. 2023. Joint multimodal sentiment analysis based on information relevance. Information Processing & Management, 60(2): 103193
work page 2023
-
[13]
Cohen, G.; Afshar, S.; Tapson, J.; and Van Schaik, A. 2017. EMNIST: Extending MNIST to handwritten letters. In 2017 international joint conference on neural networks (IJCNN), 2921--2926. IEEE
work page 2017
-
[14]
Debes, C.; Merentitis, A.; Heremans, R.; Hahn, J.; Frangiadakis, N.; Van Kasteren, T.; Liao, W.; Bellens, R.; Pi z urica, A.; Gautama, S.; et al. 2014. Hyperspectral and LiDAR data fusion: Outcome of the 2013 GRSS data fusion contest. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 7(6): 2405--2418
work page 2014
-
[15]
Fersini, E.; Gasparini, F.; Rizzi, G.; Saibene, A.; Chulvi, B.; Rosso, P.; Lees, A.; and Sorensen, J. 2022. SemEval-2022 Task 5: Multimedia automatic misogyny identification. In Proceedings of the 16th International Workshop on Semantic Evaluation (SemEval-2022), 533--549
work page 2022
-
[16]
S.; Shlens, J.; Bengio, S.; Dean, J.; Ranzato, M.; and Mikolov, T
Frome, A.; Corrado, G. S.; Shlens, J.; Bengio, S.; Dean, J.; Ranzato, M.; and Mikolov, T. 2013. Devise: A deep visual-semantic embedding model. Advances in neural information processing systems, 26
work page 2013
-
[17]
Gader, P.; Zare, A.; Close, R.; Aitken, J.; and Tuell, G. 2013. MUUFL Gulfport hyperspectral and LiDAR airborne data set. Univ. Florida, Gainesville, FL, USA, Tech. Rep. REP-2013-570
work page 2013
-
[18]
Goyal, Y.; Khot, T.; Summers-Stay, D.; Batra, D.; and Parikh, D. 2017. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, 6904--6913
work page 2017
-
[19]
Han, Z.; Zhang, C.; Fu, H.; and Zhou, J. T. 2022. Trusted multi-view classification with dynamic evidential fusion. IEEE transactions on pattern analysis and machine intelligence, 45(2): 2551--2566
work page 2022
-
[20]
Hossain, E.; Sharif, O.; and Hoque, M. M. 2022. MUTE: A multimodal dataset for detecting hateful memes. In Proceedings of the 2nd conference of the asia-pacific chapter of the association for computational linguistics and the 12th international joint conference on natural language processing: student research workshop, 32--39
work page 2022
-
[21]
Hu, J.; Liu, R.; Hong, D.; Camero, A.; Yao, J.; Schneider, M.; Kurz, F.; Segl, K.; and Zhu, X. X. 2023. MDAS: A new multimodal benchmark dataset for remote sensing. Earth System Science Data, 15(1): 113--131
work page 2023
-
[22]
Huiskes, M. J.; and Lew, M. S. 2008. The mir flickr retrieval evaluation. In Proceedings of the 1st ACM international conference on Multimedia information retrieval, 39--43
work page 2008
- [23]
-
[24]
Johnson, A. E.; Pollard, T. J.; Berkowitz, S. J.; Greenbaum, N. R.; Lungren, M. P.; Deng, C.-y.; Mark, R. G.; and Horng, S. 2019. MIMIC-CXR, a de-identified publicly available database of chest radiographs with free-text reports. Scientific data, 6(1): 317
work page 2019
-
[25]
Johnson, A. E.; Pollard, T. J.; Shen, L.; Lehman, L.-w. H.; Feng, M.; Ghassemi, M.; Moody, B.; Szolovits, P.; Anthony Celi, L.; and Mark, R. G. 2016. MIMIC-III, a freely accessible critical care database. Scientific data, 3(1): 1--9
work page 2016
-
[26]
Kawahara, J.; Daneshvar, S.; Argenziano, G.; and Hamarneh, G. 2018. Seven-point checklist and skin lesion classification using multitask multimodal neural nets. IEEE journal of biomedical and health informatics, 23(2): 538--546
work page 2018
-
[27]
Kiros, R.; Salakhutdinov, R.; and Zemel, R. S. 2014. Unifying visual-semantic embeddings with multimodal neural language models. arXiv preprint arXiv:1411.2539
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[28]
Kong, J.; Cooper, L. A. D.; Wang, F.; Gutman, D. A.; Gao, J.; Chisolm, C.; Sharma, A.; Pan, T.; Van Meir, E. G.; Kurc, T. M.; Moreno, C. S.; Saltz, J. H.; and Brat, D. J. 2011. Integrative, Multi-modal Analysis of Glioblastoma Using TCGA Molecular Data, Pathology Images and Clinical Outcomes. IEEE Transactions on Biomedical Engineering, 58(12): 3469--3474
work page 2011
-
[29]
Li, R.; Yang, S.; Ross, D. A.; and Kanazawa, A. 2021. Ai choreographer: Music conditioned 3d dance generation with aist++. In Proceedings of the IEEE/CVF international conference on computer vision, 13401--13412
work page 2021
-
[30]
Li, Y.; Li, Y.; Wang, X.; Jiang, Y.; Zhang, Z.; Zheng, X.; Wang, H.; Zheng, H.-T.; Huang, F.; Zhou, J.; et al. 2024. Benchmarking Multimodal Retrieval Augmented Generation with Dynamic VQA Dataset and Self-adaptive Planning Agent. In The Thirteenth International Conference on Learning Representations
work page 2024
-
[31]
P.; Lyu, Y.; Fan, X.; Wu, Z.; Cheng, Y.; Wu, J.; Chen, L.; Wu, P.; Lee, M
Liang, P. P.; Lyu, Y.; Fan, X.; Wu, Z.; Cheng, Y.; Wu, J.; Chen, L.; Wu, P.; Lee, M. A.; Zhu, Y.; et al. 2021. Multibench: Multiscale benchmarks for multimodal representation learning. Advances in neural information processing systems, 2021(DB1): 1
work page 2021
-
[32]
P.; Zadeh, A.; and Morency, L.-P
Liang, P. P.; Zadeh, A.; and Morency, L.-P. 2024. Foundations & trends in multimodal machine learning: Principles, challenges, and open questions. ACM Computing Surveys, 56(10): 1--42
work page 2024
- [33]
-
[34]
Lin, N.; Wang, S.; Li, Y.; Wang, B.; Shi, S.; He, Y.; Zhang, W.; Yu, Y.; Zhang, Y.; Zhang, X.; et al. 2025. Resistive memory-based zero-shot liquid state machine for multimodal event data learning. Nature Computational Science, 5(1): 37--47
work page 2025
-
[35]
Liu, Y.; Yuan, Z.; Mao, H.; Liang, Z.; Yang, W.; Qiu, Y.; Cheng, T.; Li, X.; Xu, H.; and Gao, K. 2022. Make acoustic and visual cues matter: Ch-sims v2. 0 dataset and av-mixup consistent module. In Proceedings of the 2022 international conference on multimodal interaction, 247--258
work page 2022
-
[36]
Lu, D.; Neves, L.; Carvalho, V.; Zhang, N.; and Ji, H. 2018. Visual attention model for name tagging in multimodal social media. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 1990--1999
work page 2018
-
[37]
Lu, J.; Batra, D.; Parikh, D.; and Lee, S. 2019. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Advances in neural information processing systems, 32
work page 2019
-
[38]
Nagrani, A.; Yang, S.; Arnab, A.; Jansen, A.; Schmid, C.; and Sun, C. 2021. Attention bottlenecks for multimodal fusion. Advances in neural information processing systems, 34: 14200--14213
work page 2021
-
[39]
Niu, T.; Zhu, S.; Pang, L.; and El Saddik, A. 2016. Sentiment analysis on multi-view social data. In International conference on multimedia modeling, 15--27. Springer
work page 2016
-
[40]
Okujeni, A.; van der Linden, S.; and Hostert, P. 2016. Berlin-urban-gradient dataset 2009-an enmap preparatory flight campaign
work page 2016
-
[41]
Orchard, G.; Jayawant, A.; Cohen, G. K.; and Thakor, N. 2015. Converting static image datasets to spiking neuromorphic datasets using saccades. Frontiers in neuroscience, 9: 437
work page 2015
-
[42]
Pollard, T. J.; Johnson, A. E.; Raffa, J. D.; Celi, L. A.; Mark, R. G.; and Badawi, O. 2018. The eICU Collaborative Research Database, a freely available multi-center database for critical care research. Scientific data, 5(1): 1--13
work page 2018
-
[43]
Poria, S.; Hazarika, D.; Majumder, N.; Naik, G.; Cambria, E.; and Mihalcea, R. 2019. MELD: A Multimodal Multi-Party Dataset for Emotion Recognition in Conversations. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 527--536
work page 2019
-
[44]
Ramachandram, D.; and Taylor, G. W. 2017. Deep multimodal learning: A survey on recent advances and trends. IEEE signal processing magazine, 34(6): 96--108
work page 2017
-
[45]
Rotemberg, V.; Kurtansky, N.; Betz-Stablein, B.; Caffery, L.; Chousakos, E.; Codella, N.; Combalia, M.; Dusza, S.; Guitera, P.; Gutman, D.; et al. 2021. A patient-centric dataset of images and metadata for identifying melanomas using clinical context. Scientific data, 8(1): 34
work page 2021
-
[46]
Sharma, C.; Bhageria, D.; Scott, W.; Pykl, S.; Das, A.; Chakraborty, T.; Pulabaigari, V.; and Gamb \"a ck, B. 2020. SemEval-2020 Task 8: Memotion Analysis-the Visuo-Lingual Metaphor! In Proceedings of the Fourteenth Workshop on Semantic Evaluation, 759--773
work page 2020
-
[47]
Shi, Y.; Paige, B.; Torr, P.; et al. 2019. Variational mixture-of-experts autoencoders for multi-modal deep generative models. Advances in neural information processing systems, 32
work page 2019
-
[48]
Silberman, N.; Hoiem, D.; Kohli, P.; and Fergus, R. 2012. Indoor segmentation and support inference from rgbd images. In European conference on computer vision, 746--760. Springer
work page 2012
-
[49]
Soleymani, M.; Garcia, D.; Jou, B.; Schuller, B.; Chang, S.-F.; and Pantic, M. 2017. A survey of multimodal sentiment analysis. Image and Vision Computing, 65: 3--14
work page 2017
-
[50]
Song, S.; Lichtenberg, S. P.; and Xiao, J. 2015. Sun rgb-d: A rgb-d scene understanding benchmark suite. In Proceedings of the IEEE conference on computer vision and pattern recognition, 567--576
work page 2015
-
[51]
R.; Arcan, M.; and Buitelaar, P
Suryawanshi, S.; Chakravarthi, B. R.; Arcan, M.; and Buitelaar, P. 2020. Multimodal meme dataset (MultiOFF) for identifying offensive content in image and text. In Proceedings of the second workshop on trolling, aggression and cyberbullying, 32--41
work page 2020
-
[52]
Tao, M.; Huang, Q.; Xu, K.; Chen, L.; Feng, Y.; and Zhao, D. 2024. Probing Multimodal Large Language Models for Global and Local Semantic Representations. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), 13050--13056
work page 2024
-
[53]
Tsai, Y.-H. H.; Bai, S.; Liang, P. P.; Kolter, J. Z.; Morency, L.-P.; and Salakhutdinov, R. 2019. Multimodal transformer for unaligned multimodal language sequences. In Proceedings of the conference. Association for computational linguistics. Meeting, volume 2019, 6558
work page 2019
-
[54]
University of Trento . 2022. Theses of the University of Trento. [Data set]. Original work published 2020
work page 2022
-
[55]
Wang, X.; Kumar, D.; Thome, N.; Cord, M.; and Precioso, F. 2015. Recipe recognition with large multimodal food dataset. In 2015 IEEE International Conference on Multimedia & Expo Workshops (ICMEW), 1--6. IEEE
work page 2015
-
[56]
Wang, Y.; Chen, X.; Cao, L.; Huang, W.; Sun, F.; and Wang, Y. 2022. Multimodal Token Fusion for Vision Transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition ( CVPR )
work page 2022
-
[57]
Wei, X.; Zhang, T.; Li, Y.; Zhang, Y.; and Wu, F. 2020. Multi-Modality Cross Attention Network for Image and Sentence Matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition ( CVPR )
work page 2020
-
[58]
Weinstein, J. N.; Collisson, E. A.; Mills, G. B.; Shaw, K. R.; Ozenberger, B. A.; Ellrott, K.; Shmulevich, I.; Sander, C.; and Stuart, J. M. 2013. The cancer genome atlas pan-cancer analysis project. Nature genetics, 45(10): 1113--1120
work page 2013
-
[59]
Willett, F. R.; Avansino, D. T.; Hochberg, L. R.; Henderson, J. M.; and Shenoy, K. V. 2021. High-performance brain-to-text communication via handwriting. Nature, 593(7858): 249--254
work page 2021
-
[60]
Wu, J.; Fang, H.; Li, F.; Fu, H.; Lin, F.; Li, J.; Huang, Y.; Yu, Q.; Song, S.; Xu, X.; et al. 2023. Gamma challenge: glaucoma grading from multi-modality images. Medical Image Analysis, 90: 102938
work page 2023
-
[61]
Xu, B.; Li, T.; Zheng, J.; Naseriparsa, M.; Zhao, Z.; Lin, H.; and Xia, F. 2022. Met-meme: A multimodal meme dataset rich in metaphors. In Proceedings of the 45th international ACM SIGIR conference on research and development in information retrieval, 2887--2899
work page 2022
-
[62]
Xu, P.; Zhu, X.; and Clifton, D. A. 2023. Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(10): 12113--12132
work page 2023
-
[63]
Xu, Y.; Du, B.; Zhang, F.; and Zhang, L. 2018. Hyperspectral image classification via a random patches network. ISPRS journal of photogrammetry and remote sensing, 142: 344--357
work page 2018
-
[64]
Yu, W.; Xu, H.; Meng, F.; Zhu, Y.; Ma, Y.; Wu, J.; Zou, J.; and Yang, K. 2020. Ch-sims: A chinese multimodal sentiment analysis dataset with fine-grained annotation of modality. In Proceedings of the 58th annual meeting of the association for computational linguistics, 3718--3727
work page 2020
-
[65]
Zadeh, A.; Zellers, R.; Pincus, E.; and Morency, L.-P. 2016. Mosi: multimodal corpus of sentiment intensity and subjectivity analysis in online opinion videos. arXiv preprint arXiv:1606.06259
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[66]
Zadeh, A. B.; Liang, P. P.; Poria, S.; Cambria, E.; and Morency, L.-P. 2018. Multimodal language analysis in the wild: Cmu-mosei dataset and interpretable dynamic fusion graph. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2236--2246
work page 2018
-
[67]
Zhan, X.; Wu, Y.; Dong, X.; Wei, Y.; Lu, M.; Zhang, Y.; Xu, H.; and Liang, X. 2021. Product1m: Towards weakly supervised instance-level product retrieval via cross-modal pretraining. In Proceedings of the IEEE/CVF international conference on computer vision, 11782--11791
work page 2021
-
[68]
Zhang, Q.; Fu, J.; Liu, X.; and Huang, X. 2018. Adaptive co-attention network for named entity recognition in tweets. In Proceedings of the AAAI conference on artificial intelligence, volume 32
work page 2018
- [69]
-
[70]
Zhang, Q.; Wu, H.; Zhang, C.; Hu, Q.; Fu, H.; Zhou, J. T.; and Peng, X. 2023. Provable dynamic fusion for low-quality multimodal data. In International conference on machine learning, 41753--41769. PMLR
work page 2023
-
[71]
Zhu, J.; Zhou, Y.; Qian, S.; He, Z.; Zhao, T.; Shah, N.; and Koutra, D. 2025. Mosaic of modalities: A comprehensive benchmark for multimodal graph learning. In Proceedings of the Computer Vision and Pattern Recognition Conference, 14215--14224
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.