Video-llama: An instruction-tuned audio-visual language model for video understanding

Hang Zhang, Xin Li, Lidong Bing · 2023

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

browse 1 citing papers

representative citing papers

Small Vision-Language Models are Smart Compressors for Long Video Understanding

cs.CV · 2026-04-09 · unverdicted · novelty 6.0

Tempo uses a 6B SVLM as a local temporal compressor with training-free adaptive token allocation to achieve SOTA long-video understanding at 0.5-16 tokens per frame, scoring 52.3 on 4101s LVBench under 8K budget.

citing papers explorer

Showing 1 of 1 citing paper.

Small Vision-Language Models are Smart Compressors for Long Video Understanding cs.CV · 2026-04-09 · unverdicted · none · ref 16
Tempo uses a 6B SVLM as a local temporal compressor with training-free adaptive token allocation to achieve SOTA long-video understanding at 0.5-16 tokens per frame, scoring 52.3 on 4101s LVBench under 8K budget.

Video-llama: An instruction-tuned audio-visual language model for video understanding

fields

years

verdicts

representative citing papers

citing papers explorer