BDATP enhances generalization in audio-visual navigation by explicitly modeling interaural differences and using auxiliary action prediction, achieving up to 21.6 percentage point gains in success rate on unheard sounds in Replica dataset.
Dynamic multi-target fusion for efficient audio-visual navigation
6 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 6verdicts
UNVERDICTED 6representative citing papers
EAD-Net uses a diffusion model with new spatio-temporal attention, graph-based temporal reasoning, and LLM-derived semantic descriptions to generate emotionally expressive talking head videos with improved lip-sync and coherence over prior methods.
RAVN improves audio-visual navigation by learning audio-derived reliability cues via an Acoustic Geometry Reasoner and using them to modulate visual features through Reliability-Aware Geometric Modulation.
SACF discretizes target direction and distance from audio-visual cues then applies conditioned fusion to improve navigation efficiency and generalization to unheard sounds.
Audio Spatially-Guided Fusion improves generalization in audio-visual navigation on unheard sound sources by extracting spatial audio features and adaptively fusing them with visual data.
ML-SAN uses input calibration with FiLM, interaction gating, and output regularization to adapt emotion recognition to individual speaker styles, reporting gains on MELD and IEMOCAP especially for rare sentiment classes.
citing papers explorer
-
Generalizable Audio-Visual Navigation via Binaural Difference Attention and Action Transition Prediction
BDATP enhances generalization in audio-visual navigation by explicitly modeling interaural differences and using auxiliary action prediction, achieving up to 21.6 percentage point gains in success rate on unheard sounds in Replica dataset.
-
EAD-Net: Emotion-Aware Talking Head Generation with Spatial Refinement and Temporal Coherence
EAD-Net uses a diffusion model with new spatio-temporal attention, graph-based temporal reasoning, and LLM-derived semantic descriptions to generate emotionally expressive talking head videos with improved lip-sync and coherence over prior methods.
-
Reliability-Aware Geometric Fusion for Robust Audio-Visual Navigation
RAVN improves audio-visual navigation by learning audio-derived reliability cues via an Acoustic Geometry Reasoner and using them to modulate visual features through Reliability-Aware Geometric Modulation.
-
Spatial-Aware Conditioned Fusion for Audio-Visual Navigation
SACF discretizes target direction and distance from audio-visual cues then applies conditioned fusion to improve navigation efficiency and generalization to unheard sounds.
-
Audio Spatially-Guided Fusion for Audio-Visual Navigation
Audio Spatially-Guided Fusion improves generalization in audio-visual navigation on unheard sound sources by extracting spatial audio features and adaptively fusing them with visual data.
-
ML-SAN: Multi-Level Speaker-Adaptive Network for Emotion Recognition in Conversations
ML-SAN uses input calibration with FiLM, interaction gating, and output regularization to adapt emotion recognition to individual speaker styles, reporting gains on MELD and IEMOCAP especially for rare sentiment classes.