Avicuna: Audio-visual llm with interleaver and context-boundary alignment for temporal referential dialogue

· 2024 · arXiv 2403.16276

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it

read on arXiv browse 4 citing papers

citation-role summary

background 2

citation-polarity summary

background 2

representative citing papers

EAR: Enhancing Uni-Modal Representations for Weakly Supervised Audio-Visual Video Parsing

cs.CV · 2026-05-09 · unverdicted · novelty 5.0

The EAR method improves uni-modal event perception in audio-visual video parsing via targeted enhancements to pseudo-label generation and feature modeling, outperforming prior state-of-the-art approaches.

AV-Master: Dual-Path Comprehensive Perception Makes Better Audio-Visual Question Answering

cs.CV · 2025-10-21 · unverdicted · novelty 5.0

AV-Master introduces dynamic adaptive focus sampling, modality preference modeling, and dual-path contrastive loss to outperform prior methods on audio-visual question answering benchmarks.

VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding

cs.CV · 2025-01-22 · unverdicted · novelty 4.0

VideoLLaMA3 uses a vision-centric training paradigm and token-reduction design to reach competitive results on image and video benchmarks.

VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs

cs.CV · 2024-06-11 · unverdicted · novelty 4.0

VideoLLaMA 2 improves video LLMs via a new STC connector for spatial-temporal dynamics and joint audio training, reaching competitive results on video QA and captioning benchmarks.

citing papers explorer

Showing 4 of 4 citing papers.

EAR: Enhancing Uni-Modal Representations for Weakly Supervised Audio-Visual Video Parsing cs.CV · 2026-05-09 · unverdicted · none · ref 48
The EAR method improves uni-modal event perception in audio-visual video parsing via targeted enhancements to pseudo-label generation and feature modeling, outperforming prior state-of-the-art approaches.
AV-Master: Dual-Path Comprehensive Perception Makes Better Audio-Visual Question Answering cs.CV · 2025-10-21 · unverdicted · none · ref 57
AV-Master introduces dynamic adaptive focus sampling, modality preference modeling, and dual-path contrastive loss to outperform prior methods on audio-visual question answering benchmarks.
VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding cs.CV · 2025-01-22 · unverdicted · none · ref 163
VideoLLaMA3 uses a vision-centric training paradigm and token-reduction design to reach competitive results on image and video benchmarks.
VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs cs.CV · 2024-06-11 · unverdicted · none · ref 45
VideoLLaMA 2 improves video LLMs via a new STC connector for spatial-temporal dynamics and joint audio training, reaching competitive results on video QA and captioning benchmarks.

Avicuna: Audio-visual llm with interleaver and context-boundary alignment for temporal referential dialogue

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer