AVID is the first large-scale benchmark for audio-visual inconsistency detection, grounding, classification, and reasoning in long videos, constructed via agent-driven methods and showing that state-of-the-art models struggle while a fine-tuned baseline improves performance.
Title resolution pending
8 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
verdicts
UNVERDICTED 8roles
dataset 1polarities
background 1representative citing papers
RoleMAG learns neighbor roles in multimodal graphs to route shared, complementary, and heterophilous signals through separate channels, improving propagation without modality interference.
A pose-conditioned large-margin contrastive encoder isolates persistent biometric identity cues from transmitted latents in talking-head videoconferencing to flag impersonation attacks via cosine similarity without inspecting the output video.
CoGate-LSTM adds prototype-guided cosine feature-space gating to a character-level BiLSTM with multi-source embeddings and focal loss, reaching 0.881 macro-F1 on Jigsaw toxic comments while using 7.3M parameters and outperforming fine-tuned BERT by 6.9 points on minority labels.
Introduces OW-SED paradigm and WOOT framework with deformable attention for detecting known and unseen sound events in open-world settings.
SafeScreen enforces individualized safety constraints as a prerequisite for video retrieval by using profile extraction, adaptive VideoRAG analysis, and LLM decision-making to approve content for vulnerable users.
SDA uses structural alignment as a soft teacher and gated low-rank expert paths to adapt LVLMs for multimodal recommendation, reporting 6.15% Hit@10 and 8.64% NDCG@10 average gains plus larger long-tail improvements on Amazon datasets.
A literature survey on abstract concept recognition in videos that catalogs prior tasks and datasets while advocating for foundation models and reuse of decades of community experience.
citing papers explorer
-
AVID: A Benchmark for Omni-Modal Audio-Visual Inconsistency Understanding via Agent-Driven Construction
AVID is the first large-scale benchmark for audio-visual inconsistency detection, grounding, classification, and reasoning in long videos, constructed via agent-driven methods and showing that state-of-the-art models struggle while a fine-tuned baseline improves performance.
-
RoleMAG: Learning Neighbor Roles in Multimodal Graphs
RoleMAG learns neighbor roles in multimodal graphs to route shared, complementary, and heterophilous signals through separate channels, improving propagation without modality interference.
-
Unmasking Puppeteers: Leveraging Biometric Leakage to Expose Impersonation in AI-Based Videoconferencing
A pose-conditioned large-margin contrastive encoder isolates persistent biometric identity cues from transmitted latents in talking-head videoconferencing to flag impersonation attacks via cosine similarity without inspecting the output video.
-
CoGate-LSTM: Prototype-Guided Feature-Space Gating for Mitigating Gradient Dilution in Imbalanced Toxic Comment Classification
CoGate-LSTM adds prototype-guided cosine feature-space gating to a character-level BiLSTM with multi-source embeddings and focal loss, reaching 0.881 macro-F1 on Jigsaw toxic comments while using 7.3M parameters and outperforming fine-tuned BERT by 6.9 points on minority labels.
-
Towards Open World Sound Event Detection
Introduces OW-SED paradigm and WOOT framework with deformable attention for detecting known and unseen sound events in open-world settings.
-
SafeScreen: A Safety-First Screening Framework for Personalized Video Retrieval for Vulnerable Users
SafeScreen enforces individualized safety constraints as a prerequisite for video retrieval by using profile extraction, adaptive VideoRAG analysis, and LLM decision-making to approve content for vulnerable users.
-
Structural and Disentangled Adaptation of Large Vision Language Models for Multimodal Recommendation
SDA uses structural alignment as a soft teacher and gated low-rank expert paths to adapt LVLMs for multimodal recommendation, reporting 6.15% Hit@10 and 8.64% NDCG@10 average gains plus larger long-tail improvements on Amazon datasets.
-
Looking Beyond the Obvious: A Survey on Abstract Concept Recognition for Video Understanding
A literature survey on abstract concept recognition in videos that catalogs prior tasks and datasets while advocating for foundation models and reuse of decades of community experience.