arxiv: 2605.01967 · v1 · submitted 2026-05-03 · 💻 cs.LG · cs.CV

Recognition: unknown

MER-DG: Modality-Entropy Regularization for Multimodal Domain Generalization

Yavuz Yarici , Ghassan AlRegib

Authors on Pith no claims yet

Pith reviewed 2026-05-10 14:49 UTC · model grok-4.3

classification 💻 cs.LG cs.CV

keywords multimodal domain generalizationfusion overfittingmodality entropy regularizationfeature diversitycross-modal co-occurrencesdomain-invariant featuresmultimodal fusion

0 comments

The pith

Maximizing entropy of each modality's features stops encoders from overfitting to cross-modal patterns in new domains.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard multimodal models jointly optimize separate encoders with a fusion module, causing the encoders to latch onto statistical relationships between modalities that exist only because of the training environments' recording conditions. The paper identifies this as Fusion Overfitting and shows it prevents learning features that remain useful when conditions change. To counter it, MER-DG adds a regularization term that pushes each encoder's feature distribution toward higher entropy, keeping the features more diverse and less dependent on those spurious links. The term plugs into existing architectures as an extra loss without redesigning the network. Tests on two video benchmarks show roughly 5 percent average gains over ordinary fusion and 2 percent over other recent methods.

Core claim

Joint end-to-end training of multimodal encoders and fusion modules leads encoders to exploit cross-modal co-occurrences that arise from source-specific recording conditions instead of learning domain-invariant features, a failure mode termed Fusion Overfitting. MER-DG addresses this by maximizing the entropy of each encoder's individual feature distribution, thereby preserving feature diversity and producing more robust representations that generalize to unseen domains.

What carries the argument

Modality-Entropy Regularization (MER), an additive loss term applied separately to each modality encoder that maximizes the entropy of its feature distribution to preserve diversity.

If this is right

Multimodal systems generalize better to unseen recording conditions without changing their encoder or fusion architecture.
Each modality's features remain more diverse, limiting dependence on training-specific cross-modal statistics.
The regularization adds to any existing multimodal pipeline as a single extra loss term.
Average accuracy rises by about 5 percent over basic fusion and 2 percent over prior state-of-the-art on the tested benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same entropy-maximization idea could be tested in single-modality settings by treating intra-feature dependencies as analogous to cross-modal co-occurrences.
Datasets that explicitly control the strength of cross-modal correlations would allow direct measurement of whether higher entropy disrupts those correlations.
Applying MER-DG together with adversarial domain-invariance losses might produce larger gains than either technique alone.

Load-bearing premise

That increasing the entropy of each modality's features will specifically reduce reliance on cross-modal co-occurrences rather than simply adding noise or discarding useful signals.

What would settle it

Train and test on a dataset in which cross-modal co-occurrences are deliberately randomized or removed while domain shifts are preserved, then check whether MER-DG loses its reported gains over standard fusion.

Figures

Figures reproduced from arXiv: 2605.01967 by Ghassan AlRegib, Yavuz Yarici.

**Figure 1.** Figure 1: Fusion Overfitting in multimodal domain generalization. (a) During source domain training, the fusion objective causes encoders to weight their representations toward source-specific cross-modal features (purple) that complement each other, while domain-invariant features (green) remain underutilized. (b) In source domain testing, the learned cross-modal features align correctly, enabling accurate predicti… view at source ↗

**Figure 2.** Figure 2: Spectral analysis of encoder representations on the EPICKitchens target domain. We plot log-normalized singular values for Video (Left) and Audio (Right) encoders, with RankMe scores in the legend. The Fusion Baseline (red dashed) exhibits rapid singular value decay and lower effective rank, indicating that fusion training compresses representations into low-rank subspaces. MER-DG (blue solid) counteract… view at source ↗

**Figure 3.** Figure 3: Effect of fusion training on encoder performance. Encoders trained within the fusion framework show degraded standalone performance compared to independently trained counterparts. The fused model fails to outperform the independent Video encoder. mance compared to their independently trained counterparts. The Video encoder drops from 59.46% (independent) to 56.53% (fusion-trained), while the Audio encod… view at source ↗

**Figure 4.** Figure 4: Parameter Sensitivity. Impact of λ (Left), αmarg (Middle), and αspec (Right) on EPIC-Kitchens. Dashed line: baseline. across the entire range of values tested for all three hyperparameters. This indicates that MER-DG is not overly sensitive to precise hyperparameter tuning, provided the weights are within a reasonable range. 6. Conclusion In this paper, we identify Fusion Overfitting, a failure mode in m… view at source ↗

read the original abstract

Deploying multimodal models in real-world scenarios requires generalization to new environments where recording conditions differ from training, a challenge known as multimodal domain generalization (MMDG). Standard architectures employ separate encoders for each modality and a fusion module, training the system end-to-end by optimizing on the fused features. In this paper, we identify that such joint optimization causes encoders to exploit cross-modal co-occurrences, statistical relationships between modalities that arise from source-specific recording conditions, rather than learning domain-invariant features. We term this failure mode Fusion Overfitting. To address this, we propose Modality-Entropy Regularization for Domain Generalization (MER-DG), which maximizes the entropy of each encoder's feature distribution to preserve feature diversity. MER-DG is architecture-agnostic and integrates into existing multimodal frameworks as an additive loss term. Extensive experiments on EPIC-Kitchens and HAC benchmarks demonstrate average improvements of approximately 5% over standard fusion and approximately 2% over state-of-the-art methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper names a plausible failure mode in multimodal fusion but offers no direct check that entropy regularization targets cross-modal co-occurrences rather than acting as generic smoothing.

read the letter

The paper identifies that end-to-end training of separate encoders plus a fusion module can make models rely on source-specific cross-modal statistics instead of domain-invariant features, and they call this fusion overfitting. They counter it with MER-DG, an additive loss that pushes each encoder's feature distribution toward higher entropy to keep diversity. The method is architecture-agnostic and easy to add to existing pipelines, which is a practical plus. Experiments on EPIC-Kitchens and HAC show roughly 5% lift over plain fusion and 2% over prior methods, so the numbers are at least directionally positive on standard benchmarks. The soft spot is the missing link between the proposed fix and the diagnosed problem. Nothing reported measures whether the entropy term actually lowers dependence between modalities, such as through mutual information or correlation stats before and after. The gains could come from any regularizer that prevents collapse, not from the specific mechanism claimed. Details on ablations, statistical significance, or controls for the entropy weight are also thin in what is visible. This work is aimed at people building multimodal systems for robotics or sensor fusion who already run into domain shift. A reader looking for a lightweight regularization trick and a named failure mode will find something usable here, even if the causal story stays unproven. It deserves peer review because the issue is real in practice and the approach is simple enough that referees can test the mechanism themselves.

Referee Report

3 major / 2 minor

Summary. The paper identifies a failure mode called 'Fusion Overfitting' in multimodal domain generalization, where end-to-end joint optimization of modality encoders and a fusion module causes the encoders to exploit source-specific cross-modal co-occurrences rather than learning domain-invariant features. It proposes MER-DG, an architecture-agnostic additive loss term that maximizes the entropy of each encoder's individual feature distribution to preserve diversity and mitigate this issue. Experiments on EPIC-Kitchens and HAC benchmarks report average gains of ~5% over standard fusion and ~2% over prior SOTA methods.

Significance. If the proposed entropy regularization demonstrably reduces cross-modal dependence rather than acting as generic regularization, MER-DG would offer a lightweight, plug-in improvement for multimodal generalization with broad applicability to existing fusion architectures. The architecture-agnostic design and reported gains on public benchmarks are strengths, though the absence of direct mechanistic evidence (e.g., modality dependence metrics) limits the strength of the central causal claim.

major comments (3)

[§3] §3 (Method): The central claim that maximizing per-modality feature entropy counters Fusion Overfitting by reducing exploitation of cross-modal co-occurrences is not supported by any direct measurement (e.g., mutual information, canonical correlation, or co-occurrence statistics between modality features pre- and post-regularization). Without this, accuracy gains on EPIC-Kitchens and HAC could arise from unrelated regularization effects.
[§4] §4 (Experiments): No ablation studies or controls are reported for the entropy regularization weight, nor are statistical significance tests (e.g., paired t-tests across runs) provided for the ~5% and ~2% gains. This undermines verification that the entropy term specifically addresses the identified failure mode rather than improving optimization generally.
[§2 and §3] §2 (Related Work) and §3: The definition of Fusion Overfitting relies on the assumption that joint optimization necessarily exploits source-specific cross-modal statistics, but no formalization or toy example is given to distinguish this from standard domain shift or overfitting; the entropy term's effect on this specific mechanism therefore remains unverified.

minor comments (2)

[Abstract and §3] The abstract and method description omit implementation details such as the exact entropy estimator used (e.g., histogram vs. kernel density) and how features are normalized before entropy computation.
[§4] Figure captions and table legends should explicitly state the number of runs and random seeds for reported means and standard deviations to improve reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. We address each major point below and commit to revisions that strengthen the evidence and clarity of our claims.

read point-by-point responses

Referee: [§3] §3 (Method): The central claim that maximizing per-modality feature entropy counters Fusion Overfitting by reducing exploitation of cross-modal co-occurrences is not supported by any direct measurement (e.g., mutual information, canonical correlation, or co-occurrence statistics between modality features pre- and post-regularization). Without this, accuracy gains on EPIC-Kitchens and HAC could arise from unrelated regularization effects.

Authors: We agree that direct mechanistic measurements would provide stronger support for the proposed causal mechanism. The original submission relied primarily on end-task performance gains as evidence. In the revision, we will add an analysis (new subsection in §3 and Appendix) computing canonical correlation between modality feature pairs on both benchmarks, showing that MER-DG reduces cross-modal dependence relative to the unregularized baseline. This addresses the concern that gains may stem from generic regularization. revision: yes
Referee: [§4] §4 (Experiments): No ablation studies or controls are reported for the entropy regularization weight, nor are statistical significance tests (e.g., paired t-tests across runs) provided for the ~5% and ~2% gains. This undermines verification that the entropy term specifically addresses the identified failure mode rather than improving optimization generally.

Authors: We acknowledge these omissions in the initial version. The revised manuscript will include a full ablation on the entropy weight λ (showing performance across a range of values with a clear optimum) and will report mean and standard deviation over five random seeds for all methods. We will also add paired t-test results comparing MER-DG against the strongest baselines to confirm statistical significance of the reported improvements. revision: yes
Referee: [§2 and §3] §2 (Related Work) and §3: The definition of Fusion Overfitting relies on the assumption that joint optimization necessarily exploits source-specific cross-modal statistics, but no formalization or toy example is given to distinguish this from standard domain shift or overfitting; the entropy term's effect on this specific mechanism therefore remains unverified.

Authors: Section 2 motivates Fusion Overfitting by contrasting it with unimodal domain shift: the failure arises specifically from the interaction of modality encoders under joint optimization on source data. While this intuition is provided, we accept that a concrete illustration would improve verifiability. The revision will add a controlled synthetic experiment (new paragraph in §3) with a toy multimodal dataset where cross-modal co-occurrences are explicitly source-dependent; we show that standard fusion overfits while MER-DG preserves domain-invariant features. revision: partial

Circularity Check

0 steps flagged

No circularity: proposal is an independent regularization term validated empirically

full rationale

The paper defines Fusion Overfitting conceptually and introduces MER-DG as a new additive entropy-maximization loss without any equations that reduce the claimed benefit to a fitted parameter, self-citation chain, or ansatz smuggled from prior work. The central claim rests on experimental gains on EPIC-Kitchens and HAC rather than a derivation that is tautological by construction. This is the common honest case of a method paper whose validity is external to its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on the empirical premise that entropy maximization in modality encoders yields domain-invariant features; no explicit free parameters, axioms, or invented entities are detailed in the abstract.

pith-pipeline@v0.9.0 · 5470 in / 1148 out tokens · 57984 ms · 2026-05-10T14:49:25.609668+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

47 extracted references · 13 canonical work pages · 3 internal anchors

[1]

IEEE transactions on knowledge and data engineering , volume=

Generalizing to unseen domains: A survey on domain generalization , author=. IEEE transactions on knowledge and data engineering , volume=. 2022 , publisher=

2022
[2]

ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=

An information maximization based blind source separation approach for dependent and independent sources , author=. ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=. 2022 , organization=

2022
[3]

Advances in Neural Information Processing Systems , volume=

SimMMDG: A simple and effective framework for multi-modal domain generalization , author=. Advances in Neural Information Processing Systems , volume=
[4]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

Bridging domain generalization to multimodal domain generalization via unified representations , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=
[5]

Advances in Neural Information Processing Systems , volume=

Cross-modal representation flattening for multi-modal domain generalization , author=. Advances in Neural Information Processing Systems , volume=
[6]

arXiv preprint arXiv:2405.05012 , year=

The entropy enigma: Success and failure of entropy minimization , author=. arXiv preprint arXiv:2405.05012 , year=

work page arXiv
[7]

Advances in neural information processing systems , volume=

Self-supervised learning via maximum entropy coding , author=. Advances in neural information processing systems , volume=
[8]

arXiv preprint arXiv:2306.13292 , year=

Variance-covariance regularization improves representation learning , author=. arXiv preprint arXiv:2306.13292 , year=

work page arXiv
[9]

Advances in Neural Information Processing Systems , volume=

An information theory perspective on variance-invariance-covariance regularization , author=. Advances in Neural Information Processing Systems , volume=
[10]

Proceedings of the IEEE/CVF winter conference on applications of computer vision , pages=

Domain generalization through audio-visual relative norm alignment in first person action recognition , author=. Proceedings of the IEEE/CVF winter conference on applications of computer vision , pages=
[11]

IEEE Transactions on Pattern Analysis and Machine Intelligence , year=

Advances in multimodal adaptation and generalization: From traditional approaches to foundation models , author=. IEEE Transactions on Pattern Analysis and Machine Intelligence , year=
[12]

Invariant Risk Minimization

Invariant risk minimization , author=. arXiv preprint arXiv:1907.02893 , year=

work page internal anchor Pith review arXiv 1907
[13]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Domain generalization by solving jigsaw puzzles , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
[14]

arXiv preprint arXiv:2104.02008 , year=

Domain generalization with mixstyle , author=. arXiv preprint arXiv:2104.02008 , year=

work page arXiv
[15]

Sharpness-aware minimization for efficiently improving generalization.arXiv preprint arXiv:2010.01412,

Sharpness-aware minimization for efficiently improving generalization , author=. arXiv preprint arXiv:2010.01412 , year=

work page arXiv 2010
[16]

Advances in Neural Information Processing Systems , volume=

Swad: Domain generalization by seeking flat minima , author=. Advances in Neural Information Processing Systems , volume=
[17]

arXiv preprint arXiv:1912.02178 , year=

Fantastic generalization measures and where to find them , author=. arXiv preprint arXiv:1912.02178 , year=

work page arXiv 1912
[18]

International conference on machine learning , pages=

Modality competition: What makes joint training of multi-modal network fail in deep learning?(provably) , author=. International conference on machine learning , pages=. 2022 , organization=

2022
[19]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

What makes training multi-modal classification networks hard? , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
[20]

Advances in neural information processing systems , volume=

Domain generalization for medical imaging classification with linear-dependency regularization , author=. Advances in neural information processing systems , volume=
[21]

IEEE Open Journal of Signal Processing , volume=

Exploiting the distortion-semantic interaction in fisheye data , author=. IEEE Open Journal of Signal Processing , volume=. 2023 , publisher=

2023
[22]

2025 IEEE 35th International Workshop on Machine Learning for Signal Processing (MLSP) , pages=

Subject invariant contrastive learning for human activity recognition , author=. 2025 IEEE 35th International Workshop on Machine Learning for Signal Processing (MLSP) , pages=. 2025 , organization=

2025
[23]

VICReg: Variance-Invariance-Covariance Regularization for Self-Supervised Learning

Vicreg: Variance-invariance-covariance regularization for self-supervised learning , author=. arXiv preprint arXiv:2105.04906 , year=

work page internal anchor Pith review arXiv
[24]

arXiv preprint arXiv:2305.17326 , year=

Matrix information theory for self-supervised learning , author=. arXiv preprint arXiv:2305.17326 , year=

work page arXiv
[25]

Entropy , volume=

To compress or not to compress—self-supervised learning and information theory: A review , author=. Entropy , volume=. 2024 , publisher=

2024
[26]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Vne: An effective method for improving deep representation by manipulating eigenvalue distribution , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
[27]

Proceedings of the European conference on computer vision (ECCV) , pages=

Scaling egocentric vision: The epic-kitchens dataset , author=. Proceedings of the European conference on computer vision (ECCV) , pages=
[28]

arXiv preprint arXiv:2504.17696 , year=

Hierarchical and multimodal data for daily activity understanding , author=. arXiv preprint arXiv:2504.17696 , year=

work page arXiv
[29]

International conference on machine learning , pages=

Domain generalization via invariant feature representation , author=. International conference on machine learning , pages=. 2013 , organization=

2013
[30]

International conference on machine learning , pages=

Efficient domain generalization via common-specific low-rank decomposition , author=. International conference on machine learning , pages=. 2020 , organization=

2020
[31]

International conference on machine learning , pages=

Barlow twins: Self-supervised learning via redundancy reduction , author=. International conference on machine learning , pages=. 2021 , organization=

2021
[32]

International conference on machine learning , pages=

Whitening for self-supervised representation learning , author=. International conference on machine learning , pages=. 2021 , organization=

2021
[33]

Advances in neural information processing systems , volume=

Domain generalization via entropy regularization , author=. Advances in neural information processing systems , volume=
[34]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Advent: Adversarial entropy minimization for domain adaptation in semantic segmentation , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
[35]

arXiv e-prints , keywords =

Invariant Information Bottleneck for Domain Generalization. arXiv e-prints , keywords =. doi:10.48550/arXiv.2106.06333 , archivePrefix =. 2106.06333 , primaryClass =

work page doi:10.48550/arxiv.2106.06333
[36]

International conference on machine learning , pages=

Rankme: Assessing the downstream performance of pretrained self-supervised representations by their rank , author=. International conference on machine learning , pages=. 2023 , organization=

2023
[37]

International conference on machine learning , pages=

Similarity of neural network representations revisited , author=. International conference on machine learning , pages=. 2019 , organization=

2019
[38]

International conference on machine learning , pages=

Learning transferable visual models from natural language supervision , author=. International conference on machine learning , pages=. 2021 , organization=

2021
[39]

arXiv preprint arXiv:2511.20258 , year=

Modality-Balanced Collaborative Distillation for Multi-Modal Domain Generalization , author=. arXiv preprint arXiv:2511.20258 , year=

work page arXiv
[40]

OpenMMLab's Next Generation Video Understanding Toolbox and Benchmark , author =
[41]

Proceedings of the IEEE/CVF international conference on computer vision , pages=

Slowfast networks for video recognition , author=. Proceedings of the IEEE/CVF international conference on computer vision , pages=
[42]

The Kinetics Human Action Video Dataset

The kinetics human action video dataset , author=. arXiv preprint arXiv:1705.06950 , year=

work page internal anchor Pith review arXiv
[43]

Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

Deep residual learning for image recognition , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=
[44]

ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=

Vggsound: A large-scale audio-visual dataset , author=. ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=. 2020 , organization=

2020
[45]

arXiv preprint arXiv:2505.12576 , year=

AdaDim: Dimensionality Adaptation for SSL Representational Dynamics , author=. arXiv preprint arXiv:2505.12576 , year=

work page arXiv
[46]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Unexplored faces of robustness and out-of-distribution: Covariate shifts in environment and sensor domains , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
[47]

IEEE transactions on pattern analysis and machine intelligence , volume=

Multimodal machine learning: A survey and taxonomy , author=. IEEE transactions on pattern analysis and machine intelligence , volume=. 2018 , publisher=

2018