A temporal pooling layer added to LLaVA smooths video feature distributions and lifts performance on dense video captioning and QA to new SOTA levels without extra parameters.
Video-LLaMA: An instruction-tuned audio-visual language model for video understanding
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.CV 1years
2024 1verdicts
CONDITIONAL 1representative citing papers
citing papers explorer
-
PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning
A temporal pooling layer added to LLaVA smooths video feature distributions and lifts performance on dense video captioning and QA to new SOTA levels without extra parameters.