MMDG-Bench: A Benchmark for Multimodal Domain Generalization

Da Li; Qianshan Zhan; Qian Wang; Xiao-Jun Zeng; Xiatian Zhu

arxiv: 2606.00891 · v1 · pith:VV66QLCOnew · submitted 2026-05-30 · 💻 cs.CV

MMDG-Bench: A Benchmark for Multimodal Domain Generalization

Qianshan Zhan , Qian Wang , Da Li , Xiao-Jun Zeng , Xiatian Zhu This is my paper

Pith reviewed 2026-06-28 18:48 UTC · model grok-4.3

classification 💻 cs.CV

keywords multimodal domain generalizationbenchmarkdomain generalizationmultimodal learningaction recognitionface anti-spoofingmodel robustness

0 comments

The pith

Structured pairings of a unified multi-modal setup with five domain generalization techniques in two orderings frequently outperform existing state-of-the-art methods on unseen domains.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces MMDG-Bench to create standardized evaluation for multi-modal domain generalization by defining two integration frameworks and testing ten baselines on video-audio-flow action recognition plus RGB-Depth-IR face anti-spoofing. It shows these baselines often beat prior methods while delivering three insights about consistent gains from domain generalization, ordering choices tied to modal stability, and larger benefits from stronger backbones. A sympathetic reader would care because the absence of unified protocols has kept results hard to compare and slowed development of multi-modal models that hold up outside their training domains.

Core claim

We introduce MMDG-Bench featuring DG-then-MML and MML-then-DG frameworks along with unified protocols across tasks. By instantiating ten MMDG baselines through pairing a unified MML configuration with five DG techniques under both orderings, we demonstrate that these structured combinations frequently outperform existing state-of-the-art methods. Our analysis yields three key insights: integrating DG techniques provides consistent generalization gains across various backbones whereas non-DG methods are highly sensitive to backbone shifts; the optimal framework choice depends on inter-modal stability with D2M excelling when modal relations are stable across domains while M2D is more robust to

What carries the argument

The D2M (DG then MML) and M2D (MML then DG) frameworks that structure the integration of one unified multi-modal learning configuration with five domain generalization techniques.

If this is right

Structured MMDG baselines frequently outperform existing state-of-the-art methods.
Integrating DG techniques provides consistent generalization gains across various backbones, whereas non-DG methods are highly sensitive to backbone shifts.
The optimal framework choice depends on inter-modal stability: D2M excels when modal relations are stable across domains, while M2D is more robust to cross-domain relational variance.
Stronger backbones yield amplified performance dividends when integrated into the structured frameworks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Testing the same frameworks on tasks with greater modal variance, such as medical imaging, could show whether the stability-based ordering rule holds more broadly.
Measuring inter-modal stability directly might allow automatic selection between D2M and M2D without exhaustive search.
The amplified gains from stronger backbones suggest that scaling model capacity inside these frameworks could produce larger robustness improvements than scaling alone.
Releasing the benchmark code enables direct comparison of new DG or MML techniques against the ten baselines rather than isolated SOTA numbers.

Load-bearing premise

The two selected tasks and the specific choice of one MML configuration paired with five DG techniques are representative enough to support general claims about framework superiority.

What would settle it

Running the same ten baselines on a new task or additional unseen domains and finding they no longer outperform state-of-the-art methods, or that the three reported insights on gains, ordering, and backbones fail to hold.

Figures

Figures reproduced from arXiv: 2606.00891 by Da Li, Qianshan Zhan, Qian Wang, Xiao-Jun Zeng, Xiatian Zhu.

**Figure 1.** Figure 1: Two MMDG frameworks. Modal-specific encoders Ev and Ea extract features from domains D0 and D1, with partially frozen backbones and trainable upper layers. MML aligns modalities via modality translation (Lmt) and modality contrastive loss (Lmc) before bottleneck fusion, while GBlend outputs modality-wise loss fusion weights wT . DG uses ERM (LERM) with optional regularizers: Mixup (Lmix), MMD (Lmmd), con… view at source ↗

**Figure 2.** Figure 2: MM component ablation and optimizer comparison. Bars report the mean [PITH_FULL_IMAGE:figures/full_fig_p010_2.png] view at source ↗

**Figure 3.** Figure 3: Effect of unfreezing layers (L) for MM+ERM on HAC under D2M/M2D. mance changes smoothly and remains competitive across a broad mid-range, indicating that the conclusions are not dependent on precise tuning. Specifically, λmmd and λdc are stable over 0.01–0.5, while overly large values (especially 1.0) may cause over-regularization. λmix shows a clearer optimum around 0.1, whereas λmiro is more sensitive … view at source ↗

**Figure 4.** Figure 4: Sensitivity of DG regularization weights [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗

read the original abstract

Multi-modal Domain Generalization (MMDG) seeks to leverage complementary modalities to enhance model robustness on unseen domains. Despite extensive progress in Multi-modal Learning (MML) and Domain Generalization (DG) as individual fields, their systematic integration remains under-explored. Current MMDG research is largely confined to action recognition and lacks standardized evaluation protocols. To address this, we introduce MMDG-Bench, a comprehensive benchmark featuring two foundational frameworks: DG then MML (D2M) and MML then DG (M2D). We provide unified experimental protocols across diverse tasks, including video-audio-flow action recognition and RGB-Depth-IR face anti-spoofing. By instantiating ten MMDG baselines through pairing a unified MML configuration with five DG techniques under both D2M and M2D orderings, we demonstrate that these structured combinations frequently outperform existing state-of-the-art methods, underscoring the necessity of a unified benchmarking effort. Our analysis yields three key insights: (1) Integrating DG techniques provides consistent generalization gains across various backbones, whereas non-DG methods are highly sensitive to backbone shifts; (2) The optimal framework choice depends on inter-modal stability: D2M excels when modal relations are stable across domains, while M2D is more robust to cross-domain relational variance; (3) Stronger backbones yield amplified performance dividends when integrated into our structured frameworks. MMDG-Bench provides a principled foundation and actionable design guidelines for future research in multi-modal robustness. Code is released at https://github.com/qszhan/MMDG-Bench.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper fills a gap by releasing the first MMDG benchmark with D2M/M2D protocols and code, but its claims of framework superiority rest on only two task families.

read the letter

The paper introduces MMDG-Bench along with the D2M and M2D orderings, then instantiates ten baselines by pairing one MML setup with five DG methods. It also ships code and reports three concrete observations on backbone sensitivity, framework choice under modal stability, and performance scaling. That combination is new relative to the separate MML and DG literatures and gives the area a shared starting point.

The work is strongest on the practical side: unified protocols across the two tasks, released code, and an explicit comparison structure. Those elements make it easier for others to run controlled experiments.

The soft spot is scope. The results and the three insights come from video-audio-flow action recognition plus RGB-Depth-IR face anti-spoofing. The stress-test concern about limited task diversity holds up; nothing in the abstract shows that the observed gains or the stability-versus-variance rule generalize to other multimodal shifts such as medical or robotic settings. Without broader coverage or statistical detail on the splits, the claim that the structured combinations "frequently outperform" existing SOTA stays provisional.

This is for researchers who need a common testbed for multimodal robustness work. A reader already active in DG or MML will find the baselines and ordering discussion useful even if they later expand the tasks. The paper shows clear thinking about how to combine the two fields and engages the literature honestly.

I would send it to peer review. A benchmark paper does not need perfect results on day one; it needs reproducible protocols and a clear statement of current limits, both of which are present here.

Referee Report

1 major / 0 minor

Summary. The paper introduces MMDG-Bench, a benchmark for multi-modal domain generalization (MMDG) that defines two frameworks (D2M: DG then MML; M2D: MML then DG). It instantiates ten baselines by pairing one unified MML configuration with five DG techniques under both orderings, evaluates them on video-audio-flow action recognition and RGB-Depth-IR face anti-spoofing, and reports that these baselines frequently outperform existing SOTA methods. The work also derives three insights: DG integration yields consistent gains independent of backbone; optimal framework depends on inter-modal stability vs. relational variance; and stronger backbones amplify gains within the frameworks. Code is released.

Significance. If the empirical claims hold under broader validation, the benchmark could help standardize evaluation in an under-explored intersection of MML and DG, and the released code supports reproducibility. The structured instantiation of baselines and the three insights offer actionable guidelines, though their scope is constrained by the evaluated tasks.

major comments (1)

[Abstract] Abstract: The central claim that the ten structured MMDG baselines 'frequently outperform existing state-of-the-art methods' and that the three reported insights are generally valid rests on results from only two task families (video-audio-flow action recognition and RGB-Depth-IR face anti-spoofing). This limited diversity is load-bearing for the assertion that a unified benchmarking effort is necessary and that the insights generalize across multimodal domain shifts; the manuscript provides no additional tasks or cross-category validation to support broader applicability.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed review and constructive comment on evaluation scope. We address the major comment below with a commitment to textual revisions that accurately reflect the manuscript's empirical basis.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that the ten structured MMDG baselines 'frequently outperform existing state-of-the-art methods' and that the three reported insights are generally valid rests on results from only two task families (video-audio-flow action recognition and RGB-Depth-IR face anti-spoofing). This limited diversity is load-bearing for the assertion that a unified benchmarking effort is necessary and that the insights generalize across multimodal domain shifts; the manuscript provides no additional tasks or cross-category validation to support broader applicability.

Authors: We agree that the empirical results and derived insights are based on two task families. These were deliberately chosen to span distinct modality sets (video-audio-flow; RGB-Depth-IR) and domain-shift regimes (cross-dataset action recognition; cross-device spoofing), which are the primary settings explored in prior MMDG literature. Nevertheless, the limited number of categories means the claims of frequent outperformance and general insight validity should be scoped to the evaluated benchmarks. We will revise the abstract to replace the unqualified phrasing with 'frequently outperform existing state-of-the-art methods on the evaluated tasks' and similarly qualify the three insights as holding under the tested conditions. A new Limitations paragraph will be added to the discussion section explicitly noting the current task coverage and encouraging extensions to additional multimodal categories (e.g., vision-language or audio-text). These changes require only textual edits and do not alter the experimental results or code release. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark paper with no derivation chain

full rationale

The paper defines MMDG-Bench as an empirical evaluation framework, instantiates ten baselines via explicit pairings of one MML config with five DG methods under D2M/M2D orderings, and reports performance on two task families. No equations, fitted parameters renamed as predictions, uniqueness theorems, or self-citation chains appear in the provided text. All claims reduce directly to the described experimental protocol and released code rather than to any input by construction. This is a standard benchmark contribution whose central results are externally falsifiable via the public repository.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are introduced; the contribution consists of benchmark definition, task selection, and empirical protocol design.

pith-pipeline@v0.9.1-grok · 5831 in / 1098 out tokens · 26371 ms · 2026-06-28T18:48:32.007707+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

43 extracted references · 3 canonical work pages · 2 internal anchors

[1]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Carlucci, F.M., D’Innocente, A., Bucci, S., Caputo, B., Tommasi, T.: Domain gen- eralization by solving jigsaw puzzles. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 2229–2238 (2019)

2019
[2]

In: European Conference on Computer Vision

Cha, J., Lee, K., Park, S., Chun, S.: Domain generalization by mutual-information regularization with pre-trained models. In: European Conference on Computer Vision. pp. 440–457. Springer (2022)

2022
[3]

In: ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing

Chen, H., Xie, W., Vedaldi, A., Zisserman, A.: Vggsound: A large-scale audio- visual dataset. In: ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing. pp. 721–725. IEEE (2020)

2020
[4]

Contributors, M.: Openmmlab’s next generation video understanding toolbox and benchmark.https://github.com/open-mmlab/mmaction2(2020)

2020
[5]

Advances in Neural Information Processing Systems36, 78674–78695 (2023)

Dong, H., Nejjar, I., Sun, H., Chatzi, E., Fink, O.: Simmmdg: A simple and effective framework for multi-modal domain generalization. Advances in Neural Information Processing Systems36, 78674–78695 (2023)

2023
[6]

Advances in Neural Information Processing Systems37, 66773–66795 (2024)

Fan, Y., Xu, W., Wang, H., Guo, S.: Cross-modal representation flattening for multi-modal domain generalization. Advances in Neural Information Processing Systems37, 66773–66795 (2024)

2024
[7]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recog- nition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 6202–6211 (2019)

2019
[8]

In: International Conference on Learning Representations (2021)

Foret, P., Kleiner, A., Mobahi, H., Neyshabur, B.: Sharpness-aware minimization for efficiently improving generalization. In: International Conference on Learning Representations (2021)

2021
[9]

IEEE Transactions on Information Forensics and Security15, 42–55 (2019)

George, A., Mostaani, Z., Geissenbuhler, D., Nikisins, O., Anjos, A., Marcel, S.: Biometric face presentation attack detection with multi-channel convolutional neu- ral network. IEEE Transactions on Information Forensics and Security15, 42–55 (2019)

2019
[10]

In: Proc

Gong, Y., Chung, Y.A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021. pp. 571–575 (2021).https://doi.org/10.21437/Interspeech. 2021-698

work page doi:10.21437/interspeech 2021
[11]

In: International Conference on Machine Learning

Gower, R.M., Loizou, N., Qian, X., Sailanbayev, A., Shulgin, E., Richt´ arik, P.: Sgd: General analysis and improved rates. In: International Conference on Machine Learning. pp. 5200–5209 (2019)

2019
[12]

Advances in Neural Information Processing Systems 19(2006)

Gretton, A., Borgwardt, K., Rasch, M., Sch¨ olkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in Neural Information Processing Systems 19(2006)

2006
[13]

In: Interna- tional Conference on Learning Representations (2021)

Gulrajani, I., Lopez-Paz, D.: In search of lost domain generalization. In: Interna- tional Conference on Learning Representations (2021)

2021
[14]

In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 770–778 (2016)

2016
[15]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision (2025)

Huang, H., Xia, Y., Zhou, S., Wang, H., Wang, S., Zhao, Z.: Bridging domain generalization to multimodal domain generalization via unified representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (2025)

2025
[16]

In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision

Ji, H., Lee, J., Park, E.: Alignment and distillation: A robust framework for mul- timodal domain generalizable human action recognition. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. pp. 6913– 6924 (2026) MMDG-Bench 17

2026
[17]

In: International Conference on Machine Learning

Jia, C., Yang, Y., Xia, Y., Chen, Y.T., Parekh, Z., Pham, H., Le, Q., Sung, Y.H., Li, Z., Duerig, T.: Scaling up visual and vision-language representation learning with noisy text supervision. In: International Conference on Machine Learning. pp. 4904–4916. PMLR (2021)

2021
[18]

The Kinetics Human Action Video Dataset

Kay, W., Carreira, J., Simonyan, K., Zhang, B., Hillier, C., Vijayanarasimhan, S., Viola, F., Green, T., Back, T., Natsev, P., et al.: The kinetics human action video dataset. arXiv preprint arXiv:1705.06950 (2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017
[19]

Psychological Bulletin85(2), 410 (1978)

Knapp, T.R.: Canonical correlation analysis: A general parametric significance- testing system. Psychological Bulletin85(2), 410 (1978)

1978
[20]

In: Proceedings of the AAAI Conference on Artificial Intelligence

Li, D., Yang, Y., Song, Y.Z., Hospedales, T.: Learning to generalize: Meta-learning for domain generalization. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 32 (2018)

2018
[21]

In: Proceedings of the IEEE International Conference on Computer Vision

Li, D., Yang, Y., Song, Y.Z., Hospedales, T.M.: Deeper, broader and artier domain generalization. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 5542–5550 (2017)

2017
[22]

In: ACM International Conference on Multimedia

Li, H., Wan, H., Zhang, L., Jiu, M., Li, S., Xu, M., Khan, M.H.: Towards robust multimodal domain generalization via modality-domain joint adversarial training. In: ACM International Conference on Multimedia. pp. 180–188 (2025)

2025
[23]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Lin, X., Wang, S., Cai, R., Liu, Y., Fu, Y., Tang, W., Yu, Z., Kot, A.: Suppress and rebalance: Towards generalized multi-modal face anti-spoofing. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 211–221 (2024)

2024
[24]

In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision

Liu, A., Tan, Z., Wan, J., Escalera, S., Guo, G., Li, S.Z.: Casia-surf cefa: A bench- mark for multi-modal cross-ethnicity face anti-spoofing. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. pp. 1179– 1187 (2021)

2021
[25]

In: International Conference on Learning Representations (2019)

Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: International Conference on Learning Representations (2019)

2019
[26]

Decoupled Weight Decay Regularization

Loshchilov, I., Hutter, F., et al.: Fixing weight decay regularization in adam. arXiv preprint arXiv:1711.051015(5), 5 (2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017
[27]

In: Proceedings of the IEEE International Conference on Computer Vision

Motiian, S., Piccirilli, M., Adjeroh, D.A., Doretto, G.: Unified deep supervised domain adaptation and generalization. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 5715–5725 (2017)

2017
[28]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Munro, J., Damen, D.: Multi-modal domain adaptation for fine-grained action recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 122–132 (2020)

2020
[29]

Advances in Neural Information Processing Systems 34, 14200–14213 (2021)

Nagrani, A., Yang, S., Arnab, A., Jansen, A., Schmid, C., Sun, C.: Attention bottle- necks for multimodal fusion. Advances in Neural Information Processing Systems 34, 14200–14213 (2021)

2021
[30]

In: Proceedings of the European Conference on Computer Vision

Owens, A., Efros, A.A.: Audio-visual scene analysis with self-supervised multisen- sory features. In: Proceedings of the European Conference on Computer Vision. pp. 631–648 (2018)

2018
[31]

In: Proceedings of the IEEE/CVF Winter Conference on Applications of Com- puter Vision

Planamente, M., Plizzari, C., Alberti, E., Caputo, B.: Domain generalization through audio-visual relative norm alignment in first person action recognition. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Com- puter Vision. pp. 1807–1818 (2022)

2022
[32]

In: International Conference on Machine Learning

Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning. pp. 8748–8763 (2021) 18 Q. Zhan et al

2021
[33]

IEEE Transactions on Neural Networks10(5), 988–999 (1999)

Vapnik, V.N.: An overview of statistical learning theory. IEEE Transactions on Neural Networks10(5), 988–999 (1999)

1999
[34]

IEEE Transactions on Knowledge and Data Engineering35(8), 8052–8072 (2022)

Wang, J., Lan, C., Liu, C., Ouyang, Y., Qin, T., Lu, W., Chen, Y., Zeng, W., Yu, P.S.: Generalizing to unseen domains: A survey on domain generalization. IEEE Transactions on Knowledge and Data Engineering35(8), 8052–8072 (2022)

2022
[35]

In: Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Wang, L., Huang, B., Zhao, Z., Tong, Z., He, Y., Wang, Y., Wang, Y., Qiao, Y.: Videomae v2: Scaling video masked autoencoders with dual masking. In: Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 14549–14560 (2023)

2023
[36]

Wang, W., Tran, D., Feiszli, M.: What makes training multi-modal classification networks hard? In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 12695–12705 (2020)

2020
[37]

Proceedings of the AAAI Conference on Artificial Intelligence (2026)

Wang, X., Cheng, Z., Zhong, T., Chen, L., Zhou, F.: Modality-balanced collabora- tive distillation for multi-modal domain generalization. Proceedings of the AAAI Conference on Artificial Intelligence (2026)

2026
[38]

IEEE Transactions on Pattern Analysis and Machine Intelligence45(10), 12113– 12132 (2023)

Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence45(10), 12113– 12132 (2023)

2023
[39]

In: Findings of the Association for Computational Linguistics: ACL 2022

Yao, Y., Mihalcea, R.: Modality-specific learning rates for effective multimodal additive late-fusion. In: Findings of the Association for Computational Linguistics: ACL 2022. pp. 1824–1834 (2022)

2022
[40]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Yu, Z., Liu, A., Zhao, C., Cheng, K.H., Cheng, X., Zhao, G.: Flexible-modal face anti-spoofing: A benchmark. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 6346–6351 (2023)

2023
[41]

Journal of the American Statistical Association67(339), 578–580 (1972)

Zar, J.H.: Significance testing of the spearman rank correlation coefficient. Journal of the American Statistical Association67(339), 578–580 (1972)

1972
[42]

International Conference on Learning Representations (2018)

Zhang, H., Cisse, M., Dauphin, Y.N., Lopez-Paz, D.: Mixup: Beyond empirical risk minimization. International Conference on Learning Representations (2018)

2018
[43]

IEEE Transactions on Biometrics, Behavior, and Identity Science2(2), 182–193 (2020)

Zhang, S., Liu, A., Wan, J., Liang, Y., Guo, G., Escalera, S., Escalante, H.J., Li, S.Z.: Casia-surf: A large-scale multi-modal benchmark for face anti-spoofing. IEEE Transactions on Biometrics, Behavior, and Identity Science2(2), 182–193 (2020)

2020

[1] [1]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Carlucci, F.M., D’Innocente, A., Bucci, S., Caputo, B., Tommasi, T.: Domain gen- eralization by solving jigsaw puzzles. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 2229–2238 (2019)

2019

[2] [2]

In: European Conference on Computer Vision

Cha, J., Lee, K., Park, S., Chun, S.: Domain generalization by mutual-information regularization with pre-trained models. In: European Conference on Computer Vision. pp. 440–457. Springer (2022)

2022

[3] [3]

In: ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing

Chen, H., Xie, W., Vedaldi, A., Zisserman, A.: Vggsound: A large-scale audio- visual dataset. In: ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing. pp. 721–725. IEEE (2020)

2020

[4] [4]

Contributors, M.: Openmmlab’s next generation video understanding toolbox and benchmark.https://github.com/open-mmlab/mmaction2(2020)

2020

[5] [5]

Advances in Neural Information Processing Systems36, 78674–78695 (2023)

Dong, H., Nejjar, I., Sun, H., Chatzi, E., Fink, O.: Simmmdg: A simple and effective framework for multi-modal domain generalization. Advances in Neural Information Processing Systems36, 78674–78695 (2023)

2023

[6] [6]

Advances in Neural Information Processing Systems37, 66773–66795 (2024)

Fan, Y., Xu, W., Wang, H., Guo, S.: Cross-modal representation flattening for multi-modal domain generalization. Advances in Neural Information Processing Systems37, 66773–66795 (2024)

2024

[7] [7]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recog- nition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 6202–6211 (2019)

2019

[8] [8]

In: International Conference on Learning Representations (2021)

Foret, P., Kleiner, A., Mobahi, H., Neyshabur, B.: Sharpness-aware minimization for efficiently improving generalization. In: International Conference on Learning Representations (2021)

2021

[9] [9]

IEEE Transactions on Information Forensics and Security15, 42–55 (2019)

George, A., Mostaani, Z., Geissenbuhler, D., Nikisins, O., Anjos, A., Marcel, S.: Biometric face presentation attack detection with multi-channel convolutional neu- ral network. IEEE Transactions on Information Forensics and Security15, 42–55 (2019)

2019

[10] [10]

In: Proc

Gong, Y., Chung, Y.A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021. pp. 571–575 (2021).https://doi.org/10.21437/Interspeech. 2021-698

work page doi:10.21437/interspeech 2021

[11] [11]

In: International Conference on Machine Learning

Gower, R.M., Loizou, N., Qian, X., Sailanbayev, A., Shulgin, E., Richt´ arik, P.: Sgd: General analysis and improved rates. In: International Conference on Machine Learning. pp. 5200–5209 (2019)

2019

[12] [12]

Advances in Neural Information Processing Systems 19(2006)

Gretton, A., Borgwardt, K., Rasch, M., Sch¨ olkopf, B., Smola, A.: A kernel method for the two-sample-problem. Advances in Neural Information Processing Systems 19(2006)

2006

[13] [13]

In: Interna- tional Conference on Learning Representations (2021)

Gulrajani, I., Lopez-Paz, D.: In search of lost domain generalization. In: Interna- tional Conference on Learning Representations (2021)

2021

[14] [14]

In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 770–778 (2016)

2016

[15] [15]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision (2025)

Huang, H., Xia, Y., Zhou, S., Wang, H., Wang, S., Zhao, Z.: Bridging domain generalization to multimodal domain generalization via unified representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (2025)

2025

[16] [16]

In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision

Ji, H., Lee, J., Park, E.: Alignment and distillation: A robust framework for mul- timodal domain generalizable human action recognition. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. pp. 6913– 6924 (2026) MMDG-Bench 17

2026

[17] [17]

In: International Conference on Machine Learning

Jia, C., Yang, Y., Xia, Y., Chen, Y.T., Parekh, Z., Pham, H., Le, Q., Sung, Y.H., Li, Z., Duerig, T.: Scaling up visual and vision-language representation learning with noisy text supervision. In: International Conference on Machine Learning. pp. 4904–4916. PMLR (2021)

2021

[18] [18]

The Kinetics Human Action Video Dataset

Kay, W., Carreira, J., Simonyan, K., Zhang, B., Hillier, C., Vijayanarasimhan, S., Viola, F., Green, T., Back, T., Natsev, P., et al.: The kinetics human action video dataset. arXiv preprint arXiv:1705.06950 (2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017

[19] [19]

Psychological Bulletin85(2), 410 (1978)

Knapp, T.R.: Canonical correlation analysis: A general parametric significance- testing system. Psychological Bulletin85(2), 410 (1978)

1978

[20] [20]

In: Proceedings of the AAAI Conference on Artificial Intelligence

Li, D., Yang, Y., Song, Y.Z., Hospedales, T.: Learning to generalize: Meta-learning for domain generalization. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 32 (2018)

2018

[21] [21]

In: Proceedings of the IEEE International Conference on Computer Vision

Li, D., Yang, Y., Song, Y.Z., Hospedales, T.M.: Deeper, broader and artier domain generalization. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 5542–5550 (2017)

2017

[22] [22]

In: ACM International Conference on Multimedia

Li, H., Wan, H., Zhang, L., Jiu, M., Li, S., Xu, M., Khan, M.H.: Towards robust multimodal domain generalization via modality-domain joint adversarial training. In: ACM International Conference on Multimedia. pp. 180–188 (2025)

2025

[23] [23]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Lin, X., Wang, S., Cai, R., Liu, Y., Fu, Y., Tang, W., Yu, Z., Kot, A.: Suppress and rebalance: Towards generalized multi-modal face anti-spoofing. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 211–221 (2024)

2024

[24] [24]

In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision

Liu, A., Tan, Z., Wan, J., Escalera, S., Guo, G., Li, S.Z.: Casia-surf cefa: A bench- mark for multi-modal cross-ethnicity face anti-spoofing. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. pp. 1179– 1187 (2021)

2021

[25] [25]

In: International Conference on Learning Representations (2019)

Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: International Conference on Learning Representations (2019)

2019

[26] [26]

Decoupled Weight Decay Regularization

Loshchilov, I., Hutter, F., et al.: Fixing weight decay regularization in adam. arXiv preprint arXiv:1711.051015(5), 5 (2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017

[27] [27]

In: Proceedings of the IEEE International Conference on Computer Vision

Motiian, S., Piccirilli, M., Adjeroh, D.A., Doretto, G.: Unified deep supervised domain adaptation and generalization. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 5715–5725 (2017)

2017

[28] [28]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Munro, J., Damen, D.: Multi-modal domain adaptation for fine-grained action recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 122–132 (2020)

2020

[29] [29]

Advances in Neural Information Processing Systems 34, 14200–14213 (2021)

Nagrani, A., Yang, S., Arnab, A., Jansen, A., Schmid, C., Sun, C.: Attention bottle- necks for multimodal fusion. Advances in Neural Information Processing Systems 34, 14200–14213 (2021)

2021

[30] [30]

In: Proceedings of the European Conference on Computer Vision

Owens, A., Efros, A.A.: Audio-visual scene analysis with self-supervised multisen- sory features. In: Proceedings of the European Conference on Computer Vision. pp. 631–648 (2018)

2018

[31] [31]

In: Proceedings of the IEEE/CVF Winter Conference on Applications of Com- puter Vision

Planamente, M., Plizzari, C., Alberti, E., Caputo, B.: Domain generalization through audio-visual relative norm alignment in first person action recognition. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Com- puter Vision. pp. 1807–1818 (2022)

2022

[32] [32]

In: International Conference on Machine Learning

Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning. pp. 8748–8763 (2021) 18 Q. Zhan et al

2021

[33] [33]

IEEE Transactions on Neural Networks10(5), 988–999 (1999)

Vapnik, V.N.: An overview of statistical learning theory. IEEE Transactions on Neural Networks10(5), 988–999 (1999)

1999

[34] [34]

IEEE Transactions on Knowledge and Data Engineering35(8), 8052–8072 (2022)

Wang, J., Lan, C., Liu, C., Ouyang, Y., Qin, T., Lu, W., Chen, Y., Zeng, W., Yu, P.S.: Generalizing to unseen domains: A survey on domain generalization. IEEE Transactions on Knowledge and Data Engineering35(8), 8052–8072 (2022)

2022

[35] [35]

In: Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Wang, L., Huang, B., Zhao, Z., Tong, Z., He, Y., Wang, Y., Wang, Y., Qiao, Y.: Videomae v2: Scaling video masked autoencoders with dual masking. In: Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 14549–14560 (2023)

2023

[36] [36]

Wang, W., Tran, D., Feiszli, M.: What makes training multi-modal classification networks hard? In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 12695–12705 (2020)

2020

[37] [37]

Proceedings of the AAAI Conference on Artificial Intelligence (2026)

Wang, X., Cheng, Z., Zhong, T., Chen, L., Zhou, F.: Modality-balanced collabora- tive distillation for multi-modal domain generalization. Proceedings of the AAAI Conference on Artificial Intelligence (2026)

2026

[38] [38]

IEEE Transactions on Pattern Analysis and Machine Intelligence45(10), 12113– 12132 (2023)

Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence45(10), 12113– 12132 (2023)

2023

[39] [39]

In: Findings of the Association for Computational Linguistics: ACL 2022

Yao, Y., Mihalcea, R.: Modality-specific learning rates for effective multimodal additive late-fusion. In: Findings of the Association for Computational Linguistics: ACL 2022. pp. 1824–1834 (2022)

2022

[40] [40]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Yu, Z., Liu, A., Zhao, C., Cheng, K.H., Cheng, X., Zhao, G.: Flexible-modal face anti-spoofing: A benchmark. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 6346–6351 (2023)

2023

[41] [41]

Journal of the American Statistical Association67(339), 578–580 (1972)

Zar, J.H.: Significance testing of the spearman rank correlation coefficient. Journal of the American Statistical Association67(339), 578–580 (1972)

1972

[42] [42]

International Conference on Learning Representations (2018)

Zhang, H., Cisse, M., Dauphin, Y.N., Lopez-Paz, D.: Mixup: Beyond empirical risk minimization. International Conference on Learning Representations (2018)

2018

[43] [43]

IEEE Transactions on Biometrics, Behavior, and Identity Science2(2), 182–193 (2020)

Zhang, S., Liu, A., Wan, J., Liang, Y., Guo, G., Escalera, S., Escalante, H.J., Li, S.Z.: Casia-surf: A large-scale multi-modal benchmark for face anti-spoofing. IEEE Transactions on Biometrics, Behavior, and Identity Science2(2), 182–193 (2020)

2020