The More, the Merrier: Contrastive Fusion for Higher-Order Multimodal Alignment

· 2025 · cs.CV · arXiv 2511.21331

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

open full Pith review browse 1 citing papers arXiv PDF

abstract

Learning joint representations across multiple modalities remains a central challenge in multimodal machine learning. Prevailing approaches predominantly operate in pairwise settings, aligning two modalities at a time. While some recent methods aim to capture higher-order interactions among multiple modalities, they often overlook or insufficiently preserve pairwise relationships, limiting their effectiveness on single-modality tasks. In this work, we introduce Contrastive Fusion (ConFu), a framework that jointly embeds both individual modalities and their fused combinations into a unified representation space, where modalities and their fused counterparts are aligned. ConFu extends traditional pairwise contrastive objectives with an additional fused-modality contrastive term, encouraging the joint embedding of modality pairs with a third modality. This formulation enables ConFu to capture higher-order dependencies, such as XOR-like relationships, that cannot be recovered through pairwise alignment alone, while still maintaining strong pairwise correspondence. We evaluate ConFu on synthetic and real-world multimodal benchmarks, assessing its ability to exploit cross-modal complementarity, capture higher-order dependencies, and scale with increasing multimodal complexity. Across these settings, ConFu demonstrates competitive performance on retrieval and classification tasks, while supporting unified one-to-one and two-to-one retrieval within a single contrastive framework. We release our code and dataset at https://github.com/estafons/confu.

representative citing papers

A Comparison of Fusion Techniques for Multi-Modal Human Activity Recognition on the HARMES Dataset

cs.LG · 2026-06-26 · conditional · novelty 6.0

Gated Multi-modal Fusion reaches 0.82 macro F1 on HARMES, beating the concatenation baseline of 0.76 by 6 points under leave-one-participant-out evaluation.

citing papers explorer

Showing 1 of 1 citing paper.

A Comparison of Fusion Techniques for Multi-Modal Human Activity Recognition on the HARMES Dataset cs.LG · 2026-06-26 · conditional · none · ref 19 · internal anchor
Gated Multi-modal Fusion reaches 0.82 macro F1 on HARMES, beating the concatenation baseline of 0.76 by 6 points under leave-one-participant-out evaluation.

The More, the Merrier: Contrastive Fusion for Higher-Order Multimodal Alignment

fields

years

verdicts

representative citing papers

citing papers explorer