DecAlign: Hierarchical Cross-Modal Alignment for Decoupled Multimodal Representation Learning

Chengxuan Qian , Shuo Xing , Shawn Li , Yue Zhao , Zhengzhong Tu

Authors on Pith no claims yet

classification 💻 cs.CV

keywords multimodalcross-modaldecalignalignmentsemanticacrosslearningmodalities

read the original abstract

Multimodal representation learning aims to capture both shared and complementary semantic information across multiple modalities. However, the intrinsic heterogeneity of diverse modalities presents substantial challenges to achieve effective cross-modal collaboration and integration. To address this, we introduce DecAlign, a novel hierarchical cross-modal alignment framework designed to decouple multimodal representations into modality-unique (heterogeneous) and modality-common (homogeneous) features. For handling heterogeneity, we employ a prototype-guided optimal transport alignment strategy leveraging gaussian mixture modeling and multi-marginal transport plans, thus mitigating distribution discrepancies while preserving modality-unique characteristics. To reinforce homogeneity, we ensure semantic consistency across modalities by aligning latent distribution matching with Maximum Mean Discrepancy regularization. Furthermore, we incorporate a multimodal transformer to enhance high-level semantic feature fusion, thereby further reducing cross-modal inconsistencies. Our extensive experiments on four widely used multimodal benchmarks demonstrate that DecAlign consistently outperforms existing state-of-the-art methods across five metrics. These results highlight the efficacy of DecAlign in enhancing superior cross-modal alignment and semantic consistency while preserving modality-unique features, marking a significant advancement in multimodal representation learning scenarios. Our project page is at https://taco-group.github.io/DecAlign.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

The Alignment Target Problem: Divergent Moral Judgments of Humans, AI Systems, and Their Designers
cs.CY 2026-04 unverdicted novelty 6.0

Moral judgments become more deontological when human design of AI is visible, and designers are judged more strictly than the AI or unaided humans, creating plural and non-converging targets for value alignment.
The Alignment Target Problem: Divergent Moral Judgments of Humans, AI Systems, and Their Designers
cs.CY 2026-04 conditional novelty 5.0

People judge AI systems and their human designers with markedly more deontological constraints than they apply to humans or standalone robots in the same ethical scenario.
Beyond Surface Artifacts: Capturing Shared Latent Forgery Knowledge Across Modalities
cs.CV 2026-04 unverdicted novelty 5.0

Introduces MAF framework and DeepModal-Bench to capture universal cross-modal forgery traces for better generalization in multimodal deepfake detection.