MME-RealWorld is the largest manually annotated high-resolution benchmark for MLLMs, where even the best models achieve less than 60% accuracy on challenging real-world tasks.
Llava-next: Stronger llms supercharge multimodal capabilities in the wild, May 2024
3 Pith papers cite this work. Polarity classification is still indexing.
3
Pith papers citing it
citation-role summary
baseline 1
citation-polarity summary
fields
cs.CV 3roles
baseline 1polarities
baseline 1representative citing papers
Extending language model context length enables LMMs to process over 200K visual tokens from long videos without video training, achieving SOTA on Video-MME via dense frame sampling.
VITA-1.5 integrates vision and speech into a single LLM through multi-stage training, delivering competitive benchmark results on image, video, and speech tasks with near real-time response speed.
citing papers explorer
-
Long Context Transfer from Language to Vision
Extending language model context length enables LMMs to process over 200K visual tokens from long videos without video training, achieving SOTA on Video-MME via dense frame sampling.