Video-llava: Learning united visual representation by alignment before projection

Bin Lin, Yang Ye, Bin Zhu, Jiaxi Cui, Munan Ning, Peng Jin, Li Yuan · 2023

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

browse 3 citing papers

citation-role summary

baseline 2

citation-polarity summary

baseline 2

representative citing papers

DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads

cs.CL · 2024-10-14 · conditional · novelty 7.0

DuoAttention identifies retrieval heads requiring full KV cache and streaming heads using constant-length cache to reduce memory and latency in long-context LLM inference.

LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models

cs.CV · 2024-07-10 · unverdicted · novelty 7.0

LLaVA-NeXT-Interleave unifies multi-image, video, and 3D capabilities in large multimodal models via a new 1.18M-sample interleaved dataset and benchmark, achieving leading results across those tasks while preserving single-image performance.

Long Context Transfer from Language to Vision

cs.CV · 2024-06-24 · unverdicted · novelty 6.0

Extending language model context length enables LMMs to process over 200K visual tokens from long videos without video training, achieving SOTA on Video-MME via dense frame sampling.

citing papers explorer

Showing 3 of 3 citing papers.

DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads cs.CL · 2024-10-14 · conditional · none · ref 29
DuoAttention identifies retrieval heads requiring full KV cache and streaming heads using constant-length cache to reduce memory and latency in long-context LLM inference.
LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models cs.CV · 2024-07-10 · unverdicted · none · ref 32
LLaVA-NeXT-Interleave unifies multi-image, video, and 3D capabilities in large multimodal models via a new 1.18M-sample interleaved dataset and benchmark, achieving leading results across those tasks while preserving single-image performance.
Long Context Transfer from Language to Vision cs.CV · 2024-06-24 · unverdicted · none · ref 43
Extending language model context length enables LMMs to process over 200K visual tokens from long videos without video training, achieving SOTA on Video-MME via dense frame sampling.

Video-llava: Learning united visual representation by alignment before projection

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer