Recognition: unknown
MER-DG: Modality-Entropy Regularization for Multimodal Domain Generalization
Pith reviewed 2026-05-10 14:49 UTC · model grok-4.3
The pith
Maximizing entropy of each modality's features stops encoders from overfitting to cross-modal patterns in new domains.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Joint end-to-end training of multimodal encoders and fusion modules leads encoders to exploit cross-modal co-occurrences that arise from source-specific recording conditions instead of learning domain-invariant features, a failure mode termed Fusion Overfitting. MER-DG addresses this by maximizing the entropy of each encoder's individual feature distribution, thereby preserving feature diversity and producing more robust representations that generalize to unseen domains.
What carries the argument
Modality-Entropy Regularization (MER), an additive loss term applied separately to each modality encoder that maximizes the entropy of its feature distribution to preserve diversity.
If this is right
- Multimodal systems generalize better to unseen recording conditions without changing their encoder or fusion architecture.
- Each modality's features remain more diverse, limiting dependence on training-specific cross-modal statistics.
- The regularization adds to any existing multimodal pipeline as a single extra loss term.
- Average accuracy rises by about 5 percent over basic fusion and 2 percent over prior state-of-the-art on the tested benchmarks.
Where Pith is reading between the lines
- The same entropy-maximization idea could be tested in single-modality settings by treating intra-feature dependencies as analogous to cross-modal co-occurrences.
- Datasets that explicitly control the strength of cross-modal correlations would allow direct measurement of whether higher entropy disrupts those correlations.
- Applying MER-DG together with adversarial domain-invariance losses might produce larger gains than either technique alone.
Load-bearing premise
That increasing the entropy of each modality's features will specifically reduce reliance on cross-modal co-occurrences rather than simply adding noise or discarding useful signals.
What would settle it
Train and test on a dataset in which cross-modal co-occurrences are deliberately randomized or removed while domain shifts are preserved, then check whether MER-DG loses its reported gains over standard fusion.
Figures
read the original abstract
Deploying multimodal models in real-world scenarios requires generalization to new environments where recording conditions differ from training, a challenge known as multimodal domain generalization (MMDG). Standard architectures employ separate encoders for each modality and a fusion module, training the system end-to-end by optimizing on the fused features. In this paper, we identify that such joint optimization causes encoders to exploit cross-modal co-occurrences, statistical relationships between modalities that arise from source-specific recording conditions, rather than learning domain-invariant features. We term this failure mode Fusion Overfitting. To address this, we propose Modality-Entropy Regularization for Domain Generalization (MER-DG), which maximizes the entropy of each encoder's feature distribution to preserve feature diversity. MER-DG is architecture-agnostic and integrates into existing multimodal frameworks as an additive loss term. Extensive experiments on EPIC-Kitchens and HAC benchmarks demonstrate average improvements of approximately 5% over standard fusion and approximately 2% over state-of-the-art methods.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper identifies a failure mode called 'Fusion Overfitting' in multimodal domain generalization, where end-to-end joint optimization of modality encoders and a fusion module causes the encoders to exploit source-specific cross-modal co-occurrences rather than learning domain-invariant features. It proposes MER-DG, an architecture-agnostic additive loss term that maximizes the entropy of each encoder's individual feature distribution to preserve diversity and mitigate this issue. Experiments on EPIC-Kitchens and HAC benchmarks report average gains of ~5% over standard fusion and ~2% over prior SOTA methods.
Significance. If the proposed entropy regularization demonstrably reduces cross-modal dependence rather than acting as generic regularization, MER-DG would offer a lightweight, plug-in improvement for multimodal generalization with broad applicability to existing fusion architectures. The architecture-agnostic design and reported gains on public benchmarks are strengths, though the absence of direct mechanistic evidence (e.g., modality dependence metrics) limits the strength of the central causal claim.
major comments (3)
- [§3] §3 (Method): The central claim that maximizing per-modality feature entropy counters Fusion Overfitting by reducing exploitation of cross-modal co-occurrences is not supported by any direct measurement (e.g., mutual information, canonical correlation, or co-occurrence statistics between modality features pre- and post-regularization). Without this, accuracy gains on EPIC-Kitchens and HAC could arise from unrelated regularization effects.
- [§4] §4 (Experiments): No ablation studies or controls are reported for the entropy regularization weight, nor are statistical significance tests (e.g., paired t-tests across runs) provided for the ~5% and ~2% gains. This undermines verification that the entropy term specifically addresses the identified failure mode rather than improving optimization generally.
- [§2 and §3] §2 (Related Work) and §3: The definition of Fusion Overfitting relies on the assumption that joint optimization necessarily exploits source-specific cross-modal statistics, but no formalization or toy example is given to distinguish this from standard domain shift or overfitting; the entropy term's effect on this specific mechanism therefore remains unverified.
minor comments (2)
- [Abstract and §3] The abstract and method description omit implementation details such as the exact entropy estimator used (e.g., histogram vs. kernel density) and how features are normalized before entropy computation.
- [§4] Figure captions and table legends should explicitly state the number of runs and random seeds for reported means and standard deviations to improve reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive comments on our manuscript. We address each major point below and commit to revisions that strengthen the evidence and clarity of our claims.
read point-by-point responses
-
Referee: [§3] §3 (Method): The central claim that maximizing per-modality feature entropy counters Fusion Overfitting by reducing exploitation of cross-modal co-occurrences is not supported by any direct measurement (e.g., mutual information, canonical correlation, or co-occurrence statistics between modality features pre- and post-regularization). Without this, accuracy gains on EPIC-Kitchens and HAC could arise from unrelated regularization effects.
Authors: We agree that direct mechanistic measurements would provide stronger support for the proposed causal mechanism. The original submission relied primarily on end-task performance gains as evidence. In the revision, we will add an analysis (new subsection in §3 and Appendix) computing canonical correlation between modality feature pairs on both benchmarks, showing that MER-DG reduces cross-modal dependence relative to the unregularized baseline. This addresses the concern that gains may stem from generic regularization. revision: yes
-
Referee: [§4] §4 (Experiments): No ablation studies or controls are reported for the entropy regularization weight, nor are statistical significance tests (e.g., paired t-tests across runs) provided for the ~5% and ~2% gains. This undermines verification that the entropy term specifically addresses the identified failure mode rather than improving optimization generally.
Authors: We acknowledge these omissions in the initial version. The revised manuscript will include a full ablation on the entropy weight λ (showing performance across a range of values with a clear optimum) and will report mean and standard deviation over five random seeds for all methods. We will also add paired t-test results comparing MER-DG against the strongest baselines to confirm statistical significance of the reported improvements. revision: yes
-
Referee: [§2 and §3] §2 (Related Work) and §3: The definition of Fusion Overfitting relies on the assumption that joint optimization necessarily exploits source-specific cross-modal statistics, but no formalization or toy example is given to distinguish this from standard domain shift or overfitting; the entropy term's effect on this specific mechanism therefore remains unverified.
Authors: Section 2 motivates Fusion Overfitting by contrasting it with unimodal domain shift: the failure arises specifically from the interaction of modality encoders under joint optimization on source data. While this intuition is provided, we accept that a concrete illustration would improve verifiability. The revision will add a controlled synthetic experiment (new paragraph in §3) with a toy multimodal dataset where cross-modal co-occurrences are explicitly source-dependent; we show that standard fusion overfits while MER-DG preserves domain-invariant features. revision: partial
Circularity Check
No circularity: proposal is an independent regularization term validated empirically
full rationale
The paper defines Fusion Overfitting conceptually and introduces MER-DG as a new additive entropy-maximization loss without any equations that reduce the claimed benefit to a fitted parameter, self-citation chain, or ansatz smuggled from prior work. The central claim rests on experimental gains on EPIC-Kitchens and HAC rather than a derivation that is tautological by construction. This is the common honest case of a method paper whose validity is external to its own inputs.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
IEEE transactions on knowledge and data engineering , volume=
Generalizing to unseen domains: A survey on domain generalization , author=. IEEE transactions on knowledge and data engineering , volume=. 2022 , publisher=
2022
-
[2]
ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=
An information maximization based blind source separation approach for dependent and independent sources , author=. ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=. 2022 , organization=
2022
-
[3]
Advances in Neural Information Processing Systems , volume=
SimMMDG: A simple and effective framework for multi-modal domain generalization , author=. Advances in Neural Information Processing Systems , volume=
-
[4]
Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=
Bridging domain generalization to multimodal domain generalization via unified representations , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=
-
[5]
Advances in Neural Information Processing Systems , volume=
Cross-modal representation flattening for multi-modal domain generalization , author=. Advances in Neural Information Processing Systems , volume=
-
[6]
arXiv preprint arXiv:2405.05012 , year=
The entropy enigma: Success and failure of entropy minimization , author=. arXiv preprint arXiv:2405.05012 , year=
-
[7]
Advances in neural information processing systems , volume=
Self-supervised learning via maximum entropy coding , author=. Advances in neural information processing systems , volume=
-
[8]
arXiv preprint arXiv:2306.13292 , year=
Variance-covariance regularization improves representation learning , author=. arXiv preprint arXiv:2306.13292 , year=
-
[9]
Advances in Neural Information Processing Systems , volume=
An information theory perspective on variance-invariance-covariance regularization , author=. Advances in Neural Information Processing Systems , volume=
-
[10]
Proceedings of the IEEE/CVF winter conference on applications of computer vision , pages=
Domain generalization through audio-visual relative norm alignment in first person action recognition , author=. Proceedings of the IEEE/CVF winter conference on applications of computer vision , pages=
-
[11]
IEEE Transactions on Pattern Analysis and Machine Intelligence , year=
Advances in multimodal adaptation and generalization: From traditional approaches to foundation models , author=. IEEE Transactions on Pattern Analysis and Machine Intelligence , year=
-
[12]
Invariant risk minimization , author=. arXiv preprint arXiv:1907.02893 , year=
work page internal anchor Pith review arXiv 1907
-
[13]
Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
Domain generalization by solving jigsaw puzzles , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
-
[14]
arXiv preprint arXiv:2104.02008 , year=
Domain generalization with mixstyle , author=. arXiv preprint arXiv:2104.02008 , year=
-
[15]
Sharpness-aware minimization for efficiently improving generalization , author=. arXiv preprint arXiv:2010.01412 , year=
-
[16]
Advances in Neural Information Processing Systems , volume=
Swad: Domain generalization by seeking flat minima , author=. Advances in Neural Information Processing Systems , volume=
-
[17]
arXiv preprint arXiv:1912.02178 , year=
Fantastic generalization measures and where to find them , author=. arXiv preprint arXiv:1912.02178 , year=
-
[18]
International conference on machine learning , pages=
Modality competition: What makes joint training of multi-modal network fail in deep learning?(provably) , author=. International conference on machine learning , pages=. 2022 , organization=
2022
-
[19]
Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
What makes training multi-modal classification networks hard? , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
-
[20]
Advances in neural information processing systems , volume=
Domain generalization for medical imaging classification with linear-dependency regularization , author=. Advances in neural information processing systems , volume=
-
[21]
IEEE Open Journal of Signal Processing , volume=
Exploiting the distortion-semantic interaction in fisheye data , author=. IEEE Open Journal of Signal Processing , volume=. 2023 , publisher=
2023
-
[22]
2025 IEEE 35th International Workshop on Machine Learning for Signal Processing (MLSP) , pages=
Subject invariant contrastive learning for human activity recognition , author=. 2025 IEEE 35th International Workshop on Machine Learning for Signal Processing (MLSP) , pages=. 2025 , organization=
2025
-
[23]
VICReg: Variance-Invariance-Covariance Regularization for Self-Supervised Learning
Vicreg: Variance-invariance-covariance regularization for self-supervised learning , author=. arXiv preprint arXiv:2105.04906 , year=
work page internal anchor Pith review arXiv
-
[24]
arXiv preprint arXiv:2305.17326 , year=
Matrix information theory for self-supervised learning , author=. arXiv preprint arXiv:2305.17326 , year=
-
[25]
Entropy , volume=
To compress or not to compress—self-supervised learning and information theory: A review , author=. Entropy , volume=. 2024 , publisher=
2024
-
[26]
Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
Vne: An effective method for improving deep representation by manipulating eigenvalue distribution , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
-
[27]
Proceedings of the European conference on computer vision (ECCV) , pages=
Scaling egocentric vision: The epic-kitchens dataset , author=. Proceedings of the European conference on computer vision (ECCV) , pages=
-
[28]
arXiv preprint arXiv:2504.17696 , year=
Hierarchical and multimodal data for daily activity understanding , author=. arXiv preprint arXiv:2504.17696 , year=
-
[29]
International conference on machine learning , pages=
Domain generalization via invariant feature representation , author=. International conference on machine learning , pages=. 2013 , organization=
2013
-
[30]
International conference on machine learning , pages=
Efficient domain generalization via common-specific low-rank decomposition , author=. International conference on machine learning , pages=. 2020 , organization=
2020
-
[31]
International conference on machine learning , pages=
Barlow twins: Self-supervised learning via redundancy reduction , author=. International conference on machine learning , pages=. 2021 , organization=
2021
-
[32]
International conference on machine learning , pages=
Whitening for self-supervised representation learning , author=. International conference on machine learning , pages=. 2021 , organization=
2021
-
[33]
Advances in neural information processing systems , volume=
Domain generalization via entropy regularization , author=. Advances in neural information processing systems , volume=
-
[34]
Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
Advent: Adversarial entropy minimization for domain adaptation in semantic segmentation , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
-
[35]
Invariant Information Bottleneck for Domain Generalization. arXiv e-prints , keywords =. doi:10.48550/arXiv.2106.06333 , archivePrefix =. 2106.06333 , primaryClass =
-
[36]
International conference on machine learning , pages=
Rankme: Assessing the downstream performance of pretrained self-supervised representations by their rank , author=. International conference on machine learning , pages=. 2023 , organization=
2023
-
[37]
International conference on machine learning , pages=
Similarity of neural network representations revisited , author=. International conference on machine learning , pages=. 2019 , organization=
2019
-
[38]
International conference on machine learning , pages=
Learning transferable visual models from natural language supervision , author=. International conference on machine learning , pages=. 2021 , organization=
2021
-
[39]
arXiv preprint arXiv:2511.20258 , year=
Modality-Balanced Collaborative Distillation for Multi-Modal Domain Generalization , author=. arXiv preprint arXiv:2511.20258 , year=
-
[40]
OpenMMLab's Next Generation Video Understanding Toolbox and Benchmark , author =
-
[41]
Proceedings of the IEEE/CVF international conference on computer vision , pages=
Slowfast networks for video recognition , author=. Proceedings of the IEEE/CVF international conference on computer vision , pages=
-
[42]
The Kinetics Human Action Video Dataset
The kinetics human action video dataset , author=. arXiv preprint arXiv:1705.06950 , year=
work page internal anchor Pith review arXiv
-
[43]
Proceedings of the IEEE conference on computer vision and pattern recognition , pages=
Deep residual learning for image recognition , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=
-
[44]
ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=
Vggsound: A large-scale audio-visual dataset , author=. ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=. 2020 , organization=
2020
-
[45]
arXiv preprint arXiv:2505.12576 , year=
AdaDim: Dimensionality Adaptation for SSL Representational Dynamics , author=. arXiv preprint arXiv:2505.12576 , year=
-
[46]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
Unexplored faces of robustness and out-of-distribution: Covariate shifts in environment and sensor domains , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
-
[47]
IEEE transactions on pattern analysis and machine intelligence , volume=
Multimodal machine learning: A survey and taxonomy , author=. IEEE transactions on pattern analysis and machine intelligence , volume=. 2018 , publisher=
2018
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.