Recognition: unknown
Are We Making Progress in Multimodal Domain Generalization? A Comprehensive Benchmark Study
Pith reviewed 2026-05-08 12:16 UTC · model grok-4.3
The pith
A comprehensive benchmark reveals that recent specialized multimodal domain generalization methods offer only marginal improvements over standard training.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors introduce MMDG-Bench to standardize evaluation in multimodal domain generalization. Through extensive experiments involving 7402 neural networks across 95 tasks, they find that recent specialized methods offer only marginal gains over the ERM baseline, no method wins consistently, trimodal setups do not reliably beat bimodal ones, and all methods struggle with corruptions and missing modalities while some harm trustworthiness.
What carries the argument
MMDG-Bench, a unified evaluation framework that standardizes datasets, modality combinations, methods, and tests for accuracy, corruption robustness, missing modalities, and detection of misclassifications or out-of-distribution samples.
Load-bearing premise
The nine selected methods and six datasets sufficiently represent the diversity of approaches and challenges in the multimodal domain generalization field.
What would settle it
A new method that achieves substantially higher accuracy than ERM across all six datasets, all modality combinations, and under corruption and missing-modality conditions would challenge the finding of only marginal progress.
Figures
read the original abstract
Despite the growing popularity of Multimodal Domain Generalization (MMDG) for enhancing model robustness, it remains unclear whether reported performance gains reflect genuine algorithmic progress or are artifacts of inconsistent evaluation protocols. Current research is fragmented, with studies varying significantly across datasets, modality configurations, and experimental settings. Furthermore, existing benchmarks focus predominantly on action recognition, often neglecting critical real-world challenges such as input corruptions, missing modalities, and model trustworthiness. This lack of standardization obscures a reliable assessment of the field's advancement. To address this issue, we introduce MMDG-Bench, the first unified and comprehensive benchmark for MMDG, which standardizes evaluation across six datasets spanning three diverse tasks: action recognition, mechanical fault diagnosis, and sentiment analysis. MMDG-Bench encompasses six modality combinations, nine representative methods, and multiple evaluation settings. Beyond standard accuracy, it systematically assesses corruption robustness, missing-modality generalization, misclassification detection, and out-of-distribution detection. With 7, 402 neural networks trained in total across 95 unique cross-domain tasks, MMDG-Bench yields five key findings: (1) under fair comparisons, recent specialized MMDG methods offer only marginal improvements over ERM baseline; (2) no single method consistently outperforms others across datasets or modality combinations; (3) a substantial gap to upper-bound performance persists, indicating that MMDG remains far from solved; (4) trimodal fusion does not consistently outperform the strongest bimodal configurations; and (5) all evaluated methods exhibit significant degradation under corruption and missing-modality scenarios, with some methods further compromising model trustworthiness.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces MMDG-Bench, the first unified benchmark for Multimodal Domain Generalization (MMDG). It standardizes evaluation across six datasets spanning action recognition, mechanical fault diagnosis, and sentiment analysis; six modality combinations; nine methods; and multiple settings including corruption robustness, missing-modality generalization, misclassification detection, and OOD detection. With 7,402 networks trained over 95 cross-domain tasks, the paper reports five findings: specialized MMDG methods yield only marginal gains over ERM, no method consistently outperforms others, a large gap to upper-bound performance remains, trimodal fusion does not reliably beat the best bimodal setups, and all methods degrade under corruptions/missing modalities with some harming trustworthiness.
Significance. If the empirical conclusions are robust, this benchmark study is significant for documenting limited algorithmic progress in MMDG beyond ERM and for supplying a standardized, multi-task, multi-metric evaluation framework that future work can build upon. The scale (7,402 models) and breadth (robustness + trustworthiness metrics) are genuine strengths that could help the community avoid fragmented, non-comparable results.
major comments (3)
- [Abstract and §3] Abstract and §3 (Benchmark Construction): The claim that the nine methods are 'representative' and the six datasets provide 'unified' coverage is load-bearing for findings (1) and (2) on marginal gains and lack of consistent winner. No explicit inclusion criteria, exhaustive literature survey, or ablation demonstrating that omitted methods/datasets would not change the ranking is supplied; this directly limits the generalizability of the 'no real progress' conclusion.
- [§4.3 and Table 2] §4.3 (Implementation Details) and Table 2: The manuscript reports aggregate accuracies but provides no description of hyperparameter search ranges, exact train/val/test splits per domain, number of random seeds, or statistical significance testing. Without these, it is impossible to verify whether the reported marginal improvements over ERM are stable or could be artifacts of implementation choices.
- [§5.1] §5.1 (Main Results): The upper-bound performance is referenced but its construction (e.g., whether it uses oracle domain labels or privileged information) is not detailed enough to interpret the size of the 'substantial gap' claimed in finding (3). This gap is central to the paper's narrative that MMDG remains far from solved.
minor comments (2)
- [Figure 3 and §4.2] Figure 3 and §4.2: The visualization of modality combinations could include error bars or per-run variance to make the 'no consistent winner' claim visually clearer.
- [Related Work] Related Work section: A short table comparing MMDG-Bench to prior single-task benchmarks (e.g., on action recognition only) would help readers quickly see the added coverage.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback, which highlights important areas for improving clarity and reproducibility. We address each major comment below and commit to revisions that strengthen the manuscript without altering its core empirical findings.
read point-by-point responses
-
Referee: [Abstract and §3] Abstract and §3 (Benchmark Construction): The claim that the nine methods are 'representative' and the six datasets provide 'unified' coverage is load-bearing for findings (1) and (2) on marginal gains and lack of consistent winner. No explicit inclusion criteria, exhaustive literature survey, or ablation demonstrating that omitted methods/datasets would not change the ranking is supplied; this directly limits the generalizability of the 'no real progress' conclusion.
Authors: We selected the nine methods to cover the primary algorithmic paradigms in recent MMDG literature (invariant feature learning, augmentation-based, meta-learning, and fusion strategies) from top venues with publicly available code. The six datasets were chosen to extend beyond action recognition to fault diagnosis and sentiment analysis while supporting the six modality combinations. We will add an explicit subsection in §3 listing inclusion criteria (publication year 2020+, multimodal applicability, reproducibility) and a short discussion of omissions (e.g., methods lacking code or not supporting trimodal inputs). While an exhaustive survey or full ablation on every omitted method is outside the scope of a benchmark paper, we will note that the marginal-gains finding is consistent across the evaluated representative set. This will be a partial revision. revision: partial
-
Referee: [§4.3 and Table 2] §4.3 (Implementation Details) and Table 2: The manuscript reports aggregate accuracies but provides no description of hyperparameter search ranges, exact train/val/test splits per domain, number of random seeds, or statistical significance testing. Without these, it is impossible to verify whether the reported marginal improvements over ERM are stable or could be artifacts of implementation choices.
Authors: We agree these details are essential. In the revision we will expand §4.3 to specify: hyperparameter grids (learning rate 1e-4–1e-2, batch size 32–128, etc.), exact per-domain train/val/test splits for each of the six datasets, training with three random seeds, and paired t-test results confirming statistical significance of differences from ERM. Table 2 will be updated to report mean ± standard deviation. This is a full revision. revision: yes
-
Referee: [§5.1] §5.1 (Main Results): The upper-bound performance is referenced but its construction (e.g., whether it uses oracle domain labels or privileged information) is not detailed enough to interpret the size of the 'substantial gap' claimed in finding (3). This gap is central to the paper's narrative that MMDG remains far from solved.
Authors: The upper bound is obtained by training the same architectures on the pooled labeled data from all source and target domains (i.e., no domain shift during training), providing an oracle ceiling without domain-generalization constraints. It uses only standard supervised labels and does not rely on additional privileged information. We will add a precise description and footnote in §5.1 clarifying this construction. This is a full revision. revision: yes
Circularity Check
No circularity: purely empirical benchmark with measured results on held-out data
full rationale
The paper performs a large-scale empirical comparison of nine MMDG methods across six datasets and multiple evaluation protocols. All reported findings (marginal gains over ERM, lack of consistent winner, performance gaps) are direct measurements on held-out target domains rather than quantities derived from fitted parameters or self-referential definitions inside the paper. No equations, uniqueness theorems, or ansatzes are invoked that reduce to the paper's own inputs by construction. Selection of methods and datasets raises external-validity questions but does not create internal circularity per the enumerated patterns.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The six selected datasets and three tasks adequately sample the space of multimodal domain shifts encountered in practice.
invented entities (1)
-
MMDG-Bench
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Martin Arjovsky, Léon Bottou, Ishaan Gulrajani, and David Lopez-Paz. Invariant risk minimization.arXiv preprint arXiv:1907.02893, 2019
work page internal anchor Pith review arXiv 1907
-
[2]
Openface: an open source facial behavior analysis toolkit
Tadas Baltrušaitis, Peter Robinson, and Louis-Philippe Morency. Openface: an open source facial behavior analysis toolkit. InWACV, 2016
2016
-
[3]
Generalizing from several related classification tasks to a new unlabeled sample
Gilles Blanchard, Gyemin Lee, and Clayton Scott. Generalizing from several related classification tasks to a new unlabeled sample. InNeurIPS, 2011
2011
-
[4]
Towards robust incomplete multimodal open-set domain general- ization with uncertain missing modalities.Knowledge-Based Systems, page 115777, 2026
Xin Chen, Huanjie Tao, and Benran Li. Towards robust incomplete multimodal open-set domain general- ization with uncertain missing modalities.Knowledge-Based Systems, page 115777, 2026
2026
-
[5]
Openmmlab’s next generation video understanding toolbox and benchmark
MMAction2 Contributors. Openmmlab’s next generation video understanding toolbox and benchmark. https://github.com/open-mmlab/mmaction2, 2020
2020
-
[6]
Scaling egocentric vision: The epic-kitchens dataset
Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Sanja Fidler, Antonino Furnari, Evangelos Kazakos, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, and Michael Wray. Scaling egocentric vision: The epic-kitchens dataset. InECCV, 2018
2018
-
[7]
Bert: Pre-training of deep bidi- rectional transformers for language understanding
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidi- rectional transformers for language understanding. InProceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers), pages 4171–4186, 2019
2019
-
[8]
Towards multimodal open-set domain generalization and adaptation through self-supervision
Hao Dong, Eleni Chatzi, and Olga Fink. Towards multimodal open-set domain generalization and adaptation through self-supervision. InECCV, 2024
2024
-
[9]
Towards robust multimodal open-set test-time adaptation via adaptive entropy-aware optimization
Hao Dong, Eleni Chatzi, and Olga Fink. Towards robust multimodal open-set test-time adaptation via adaptive entropy-aware optimization. InICLR, 2025
2025
-
[10]
Hao Dong, Moru Liu, Kaiyang Zhou, Eleni Chatzi, Juho Kannala, Cyrill Stachniss, and Olga Fink. Advances in multimodal adaptation and generalization: From traditional approaches to foundation models. arXiv preprint arXiv:2501.18592, 2025
-
[11]
SimMMDG: A simple and effective framework for multi-modal domain generalization
Hao Dong, Ismail Nejjar, Han Sun, Eleni Chatzi, and Olga Fink. SimMMDG: A simple and effective framework for multi-modal domain generalization. InNeurIPS, 2023
2023
-
[12]
Cross-modal representation flattening for multi-modal domain generalization
Yunfeng Fan, Wenchao Xu, Haozhao Wang, and Song Guo. Cross-modal representation flattening for multi-modal domain generalization. InNeurIPS, 2024
2024
-
[13]
Slowfast networks for video recognition
Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He. Slowfast networks for video recognition. InICCV, 2019
2019
-
[14]
Olga Fink, Ismail Nejjar, Vinay Sharma, Keivan Faghih Niresi, Han Sun, Hao Dong, Chenghao Xu, Amaury Wei, Arthur Bizzi, Raffael Theiler, et al. From physics to machine learning and back: Part ii-learning and observational bias in prognostics and health management (phm).Reliability Engineering & System Safety, page 112376, 2026
2026
-
[15]
From physics to machine learning and back: Part i-learning with inductive biases in prognostics and health management.Reliability Engineering & System Safety, page 112213, 2026
Olga Fink, Vinay Sharma, Ismail Nejjar, Leandro V on Krannichfeldt, Sergei Garmaev, Zepeng Zhang, Amaury Wei, Gaetan Frusque, Florent Forest, Mengjie Zhao, et al. From physics to machine learning and back: Part i-learning with inductive biases in prognostics and health management.Reliability Engineering & System Safety, page 112213, 2026
2026
-
[16]
Unsupervised domain adaptation by backpropagation
Yaroslav Ganin and Victor Lempitsky. Unsupervised domain adaptation by backpropagation. InICML, 2015
2015
-
[17]
arXiv preprint arXiv:2007.01434 , year=
Ishaan Gulrajani and David Lopez-Paz. In search of lost domain generalization.arXiv preprint arXiv:2007.01434, 2020
-
[18]
Integrating audio narrations to strengthen domain generalization in multimodal first-person action recognition
Cagri Gungor and Adriana Kovashka. Integrating audio narrations to strengthen domain generalization in multimodal first-person action recognition. InICASSP, 2025
2025
-
[19]
Bridging the gap for test-time multimodal sentiment analysis
Zirun Guo, Tao Jin, Wenlong Xu, Wang Lin, and Yangyang Wu. Bridging the gap for test-time multimodal sentiment analysis. InAAAI, 2025
2025
-
[20]
Deep residual learning for image recognition
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InCVPR, 2016. 10
2016
-
[21]
The many faces of robustness: A critical analysis of out-of-distribution generalization
Dan Hendrycks, Steven Basart, Norman Mu, Saurav Kadavath, Frank Wang, Evan Dorundo, Rahul Desai, Tyler Zhu, Samyak Parajuli, Mike Guo, et al. The many faces of robustness: A critical analysis of out-of-distribution generalization. InICCV, 2021
2021
-
[22]
Benchmarking Neural Network Robustness to Common Corruptions and Perturbations
Dan Hendrycks and Thomas Dietterich. Benchmarking neural network robustness to common corruptions and perturbations.arXiv preprint arXiv:1903.12261, 2019
work page internal anchor Pith review arXiv 1903
-
[23]
Bridging domain generalization to multimodal domain generalization via unified representations
Hai Huang, Yan Xia, Sashuai Zhou, Hanting Wang, Shulei Wang, and Zhou Zhao. Bridging domain generalization to multimodal domain generalization via unified representations. InICCV, 2025
2025
-
[24]
Alignment and distillation: A robust framework for multimodal domain generalizable human action recognition
Hyeonbin Ji, Juyeob Lee, and Eunil Park. Alignment and distillation: A robust framework for multimodal domain generalizable human action recognition. InWACV, 2026
2026
-
[25]
Wilds: A benchmark of in-the-wild distribution shifts
Pang Wei Koh, Shiori Sagawa, Henrik Marklund, Sang Michael Xie, Marvin Zhang, Akshay Balsubramani, Weihua Hu, Michihiro Yasunaga, Richard Lanas Phillips, Irena Gao, et al. Wilds: A benchmark of in-the-wild distribution shifts. InICML, 2021
2021
-
[26]
Out-of-distribution generalization via risk extrapolation (rex)
David Krueger, Ethan Caballero, Joern-Henrik Jacobsen, Amy Zhang, Jonathan Binas, Dinghuai Zhang, Remi Le Priol, and Aaron Courville. Out-of-distribution generalization via risk extrapolation (rex). In ICML, 2021
2021
-
[27]
Learning to generalize: Meta-learning for domain generalization
Da Li, Yongxin Yang, Yi-Zhe Song, and Timothy M Hospedales. Learning to generalize: Meta-learning for domain generalization. InAAAI, 2018
2018
-
[28]
Towards multimodal domain generalization with few labels.arXiv preprint arXiv:2602.22917, 2026
Hongzhao Li, Hao Dong, Hualei Wan, Shupan Li, Mingliang Xu, and Muhammad Haris Khan. Towards multimodal domain generalization with few labels.arXiv preprint arXiv:2602.22917, 2026
-
[29]
Balancing multimodal domain generalization via gradient modulation and projection
Hongzhao Li, Guohao Shen, Shupan Li, Mingliang Xu, and Muhammad Haris Khan. Balancing multimodal domain generalization via gradient modulation and projection. InAAAI, 2026
2026
-
[30]
Towards robust multimodal domain generalization via modality-domain joint adversarial training
Hongzhao Li, Hualei Wan, Liangzhi Zhang, Mingyuan Jiu, Shupan Li, Mingliang Xu, and Muham- mad Haris Khan. Towards robust multimodal domain generalization via modality-domain joint adversarial training. InProceedings of the 33rd ACM International Conference on Multimedia, 2025
2025
-
[31]
Shawn Li, Huixian Gong, Hao Dong, Tiankai Yang, Zhengzhong Tu, and Yue Zhao. Dpu: Dynamic prototype updating for multimodal out-of-distribution detection.arXiv preprint arXiv:2411.08227, 2024
-
[32]
Moru Liu, Hao Dong, Olga Fink, and Mario Trapp. Adaptive confidence regularization for multimodal failure detection.arXiv preprint arXiv:2603.02200, 2026
-
[33]
Moru Liu, Hao Dong, Jessica Kelly, Olga Fink, and Mario Trapp. Extremely simple multimodal outlier synthesis for out-of-distribution detection and segmentation.arXiv preprint arXiv:2505.16985, 2025
-
[34]
librosa: Audio and music signal analysis in python.SciPy, 2015(18-24):7, 2015
Brian McFee, Colin Raffel, Dawen Liang, Daniel PW Ellis, Matt McVicar, Eric Battenberg, Oriol Nieto, et al. librosa: Audio and music signal analysis in python.SciPy, 2015(18-24):7, 2015
2015
-
[35]
Domain generalization via invariant feature representation
Krikamol Muandet, David Balduzzi, and Bernhard Schölkopf. Domain generalization via invariant feature representation. InICML, 2013
2013
-
[36]
Multi-modal domain adaptation for fine-grained action recognition
Jonathan Munro and Dima Damen. Multi-modal domain adaptation for fine-grained action recognition. In CVPR, 2020
2020
-
[37]
Domain generalization through audio-visual relative norm alignment in first person action recognition
Mirco Planamente, Chiara Plizzari, Emanuele Alberti, and Barbara Caputo. Domain generalization through audio-visual relative norm alignment in first person action recognition. InWACV, 2022
2022
-
[38]
Shiori Sagawa, Pang Wei Koh, Tatsunori B Hashimoto, and Percy Liang. Distributionally robust neural networks for group shifts: On the importance of regularization for worst-case generalization.arXiv preprint arXiv:1911.08731, 2019
work page internal anchor Pith review arXiv 1911
-
[39]
Deep coral: Correlation alignment for deep domain adaptation
Baochen Sun and Kate Saenko. Deep coral: Correlation alignment for deep domain adaptation. InECCV, 2016
2016
-
[40]
Unbiased look at dataset bias
Antonio Torralba and Alexei A Efros. Unbiased look at dataset bias. InCVPR, 2011
2011
-
[41]
An overview of statistical learning theory.IEEE transactions on neural networks, 10(5):988–999, 1999
Vladimir N Vapnik. An overview of statistical learning theory.IEEE transactions on neural networks, 10(5):988–999, 1999
1999
-
[42]
Gomez, Lukasz Kaiser, and Illia Polosukhin
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. InNeurIPS, 2017. 11
2017
-
[43]
Generalizing to unseen domains via adversarial data augmentation
Riccardo V olpi, Hongseok Namkoong, Ozan Sener, John C Duchi, Vittorio Murino, and Silvio Savarese. Generalizing to unseen domains via adversarial data augmentation. InNeurIPS, 2018
2018
-
[44]
Generalizing to unseen domains: A survey on domain generalization.IEEE transactions on knowledge and data engineering, 35(8):8052–8072, 2022
Jindong Wang, Cuiling Lan, Chang Liu, Yidong Ouyang, Tao Qin, Wang Lu, Yiqiang Chen, Wenjun Zeng, and Philip S Yu. Generalizing to unseen domains: A survey on domain generalization.IEEE transactions on knowledge and data engineering, 35(8):8052–8072, 2022
2022
-
[45]
Modality-balanced collabora- tive distillation for multi-modal domain generalization
Xiaohan Wang, Zhangtao Cheng, Ting Zhong, Leiting Chen, and Fan Zhou. Modality-balanced collabora- tive distillation for multi-modal domain generalization. InAAAI, 2026
2026
-
[46]
Ch-sims: A chinese multimodal sentiment analysis dataset with fine-grained annotation of modality
Wenmeng Yu, Hua Xu, Fanyang Meng, Yilin Zhu, Yixiao Ma, Jiele Wu, Jiyun Zou, and Kaicheng Yang. Ch-sims: A chinese multimodal sentiment analysis dataset with fine-grained annotation of modality. In ACL, 2020
2020
-
[47]
Mosi: multimodal corpus of sentiment intensity and subjectivity analysis in online opinion videos,
Amir Zadeh, Rowan Zellers, Eli Pincus, and Louis-Philippe Morency. Mosi: multimodal corpus of sentiment intensity and subjectivity analysis in online opinion videos.arXiv preprint arXiv:1606.06259, 2016
-
[48]
Multimodal language analysis in the wild: Cmu-mosei dataset and interpretable dynamic fusion graph
AmirAli Bagher Zadeh, Paul Pu Liang, Soujanya Poria, Erik Cambria, and Louis-Philippe Morency. Multimodal language analysis in the wild: Cmu-mosei dataset and interpretable dynamic fusion graph. In ACL, 2018
2018
-
[49]
Nonpolarized embedding learning in multimodal domain generalization.Neurocomputing, page 131754, 2025
Baoqiang Zhang, Kunze Huang, Luyao Luyao, Xiaotong Tu, and Xiaolu Li. Nonpolarized embedding learning in multimodal domain generalization.Neurocomputing, page 131754, 2025
2025
-
[50]
Nico++: Towards better benchmarking for domain generalization
Xingxuan Zhang, Yue He, Renzhe Xu, Han Yu, Zheyan Shen, and Peng Cui. Nico++: Towards better benchmarking for domain generalization. InCVPR, 2023
2023
-
[51]
Domain generalization for cross-domain fault diagnosis: An application-oriented perspective and a benchmark study.Reliability Engineering & System Safety, 245:109964, 2024
Chao Zhao, Enrico Zio, and Weiming Shen. Domain generalization for cross-domain fault diagnosis: An application-oriented perspective and a benchmark study.Reliability Engineering & System Safety, 245:109964, 2024
2024
-
[52]
Kaiyang Zhou, Ziwei Liu, Yu Qiao, Tao Xiang, and Chen Change Loy. Domain generalization: A survey. IEEE transactions on pattern analysis and machine intelligence, 45(4):4396–4415, 2022. 12 A Related Work A.1 Domain Generalization Domain generalization (DG), formalized by [ 3] and named by [ 35], aims to learn models that transfer to unseen target distribu...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.