Gated Multimodal Units for Information Fusion

Fabio A. Gonz\'alez; John Arevalo; Manuel Montes-y-G\'omez; Thamar Solorio

arxiv: 1702.01992 · v1 · pith:B66MMGQLnew · submitted 2017-02-07 · 📊 stat.ML · cs.LG

Gated Multimodal Units for Information Fusion

John Arevalo , Thamar Solorio , Manuel Montes-y-G\'omez , Fabio A. Gonz\'alez This is my paper

classification 📊 stat.ML cs.LG

keywords multimodalgatedunitdatasetfusiongenremodalitiesmodel

0 comments

read the original abstract

This paper presents a novel model for multimodal learning based on gated neural networks. The Gated Multimodal Unit (GMU) model is intended to be used as an internal unit in a neural network architecture whose purpose is to find an intermediate representation based on a combination of data from different modalities. The GMU learns to decide how modalities influence the activation of the unit using multiplicative gates. It was evaluated on a multilabel scenario for genre classification of movies using the plot and the poster. The GMU improved the macro f-score performance of single-modality approaches and outperformed other fusion strategies, including mixture of experts models. Along with this work, the MM-IMDb dataset is released which, to the best of our knowledge, is the largest publicly available multimodal dataset for genre prediction on movies.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 15 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

ConTact: Contact-First Antibody CDR Design via Explicit Interface Reasoning
cs.LG 2026-05 unverdicted novelty 7.0

ConTact decomposes CDR design into surface fingerprint learning, contact prediction, and contact-gated sequence generation using distance-biased attention and weighted loss, reporting 7% RMSD and 10% F1 gains on CHIME...
DecepGPT: Schema-Driven Deception Detection with Multicultural Datasets and Robust Multimodal Learning
cs.CV 2026-03 unverdicted novelty 7.0

A new 1695-sample multicultural dataset plus two modules for stable multimodal fusion and modality consistency yield state-of-the-art deception detection with cross-cultural transfer.
AgForce Enables Antigen-conditioned Generative Antibody Design
cs.LG 2026-05 unverdicted novelty 6.0

AgForce improves antigen-conditioned antibody design by using framework dropout, gated bottlenecks, hyperbolic cross attention, MDN sequence head with Potts-like coupling, annealed MCL, and antigen cycle consistency t...
EvoStruct: Bridging Evolutionary and Structural Priors for Antibody CDR Design via Protein Language Model Adaptation
cs.LG 2026-05 unverdicted novelty 6.0

EvoStruct integrates evolutionary priors from a protein language model with structural priors from an E(3)-equivariant GNN to raise amino acid recovery by 16% and diversity by 2.3x on CHIMERA-Bench while cutting perpl...
Learning Multi-Relational Graph Representations for DNA Methylation-Based Biological Age Estimation
cs.LG 2026-05 unverdicted novelty 6.0

RelAge-GNN models relationships among CpG sites via co-methylation, genomic location, and gene association graphs to estimate biological age more accurately than prior methods.
EduGage: Methods and Dataset for Sensor-Based Momentary Assessment of Engagement in Self-Guided Video Learning
cs.HC 2026-05 unverdicted novelty 6.0

EduGage releases a multimodal sensor dataset and models for estimating learner engagement in self-guided video learning, reporting MAE of 0.81 and outperforming baselines with 16 participants.
CGCMA: Conditionally-Gated Cross-Modal Attention for Event-Conditioned Asynchronous Fusion
cs.LG 2026-04 unverdicted novelty 6.0

CGCMA separates text-conditioned grounding from lag-aware trust gating to fuse asynchronous price and web data, yielding the highest Sharpe ratio of +0.449 on a new crypto news corpus.
Enhancing Foundation VLM Robustness to Missing Modality: Scalable Diffusion for Bi-directional Feature Restoration
cs.AI 2026-02 unverdicted novelty 6.0

A diffusion model with dynamic modality gating and cross-modal mutual learning restores missing features in VLMs bi-directionally while preserving the original model's generalization.
Routing-Based Continual Learning for Multimodal Large Language Models
cs.LG 2025-11 unverdicted novelty 6.0

Routing architecture for MLLMs enables continual learning with constant compute, matching multi-task learning performance and supporting cross-modal transfer.
Gated Embeddings in End-to-End Speech Recognition for Conversational-Context Fusion
cs.CL 2019-06 unverdicted novelty 6.0

Gated fusion of fastText and BERT embeddings into an end-to-end ASR model captures multi-sentence conversational context and lowers word error rate on the Switchboard corpus.
Multimodal and Multi-view Models for Emotion Recognition
cs.CL 2019-06 unverdicted novelty 5.0

Multimodal training with attention and contrastive multi-view learning improves both combined and acoustic-only emotion recognition on IEMOCAP over prior acoustic baselines.
Controlling Decision Drift in Multimodal Sentiment Analysis with Missing Modalities
cs.CV 2026-05 unverdicted novelty 4.0

A two-level reference alignment framework uses complete-modality samples and prototype voting to reduce decision drift and improve robustness in multimodal sentiment analysis under missing modalities.
Lightweight Stylistic Consistency Profiling: Robust Detection of LLM-Generated Textual Content for Multimedia Moderation
cs.CL 2026-05 unverdicted novelty 4.0

LiSCP detects LLM-generated text via stylistic consistency profiling across paraphrased variants and reports up to 11.79% better cross-domain accuracy plus robustness to adversarial attacks.
Dementia classification from spontaneous speech using wrapper-based feature selection
eess.AS 2025-02 unverdicted novelty 4.0

Using full-recording acoustic features and wrapper selection, the Extreme Minimal Learning Machine provides competitive dementia classification accuracy at lower computational cost on ADReSS and Pitt datasets.
A Two-Stage Dual-Modality Model for Facial Emotional Expression Recognition
cs.CV 2026-03 conditional novelty 3.0

A dual-modality model combining DINOv2 visual features with Wav2Vec audio features achieves Macro-F1 of 0.5368 on the ABAW validation set for facial expression recognition.