The EAR method improves uni-modal event perception in audio-visual video parsing via targeted enhancements to pseudo-label generation and feature modeling, outperforming prior state-of-the-art approaches.
Avicuna: Audio-visual llm with interleaver and context-boundary alignment for temporal referential dialogue
4 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
fields
cs.CV 4verdicts
UNVERDICTED 4roles
background 2polarities
background 2representative citing papers
AV-Master introduces dynamic adaptive focus sampling, modality preference modeling, and dual-path contrastive loss to outperform prior methods on audio-visual question answering benchmarks.
VideoLLaMA3 uses a vision-centric training paradigm and token-reduction design to reach competitive results on image and video benchmarks.
VideoLLaMA 2 improves video LLMs via a new STC connector for spatial-temporal dynamics and joint audio training, reaching competitive results on video QA and captioning benchmarks.
citing papers explorer
-
EAR: Enhancing Uni-Modal Representations for Weakly Supervised Audio-Visual Video Parsing
The EAR method improves uni-modal event perception in audio-visual video parsing via targeted enhancements to pseudo-label generation and feature modeling, outperforming prior state-of-the-art approaches.
-
AV-Master: Dual-Path Comprehensive Perception Makes Better Audio-Visual Question Answering
AV-Master introduces dynamic adaptive focus sampling, modality preference modeling, and dual-path contrastive loss to outperform prior methods on audio-visual question answering benchmarks.
-
VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding
VideoLLaMA3 uses a vision-centric training paradigm and token-reduction design to reach competitive results on image and video benchmarks.
-
VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs
VideoLLaMA 2 improves video LLMs via a new STC connector for spatial-temporal dynamics and joint audio training, reaching competitive results on video QA and captioning benchmarks.