Introduces VIG metric to measure visual contribution via perplexity reduction and applies it for selective training of LVLMs on high-VIG samples and tokens to improve grounding with reduced supervision.
Minigpt-v2: large language model as a unified interface for vision-language multi-task learning, 2023
4 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
fields
cs.CV 4roles
background 1polarities
background 1representative citing papers
MuirBench is a new benchmark showing that top multimodal LLMs struggle with robust multi-image understanding, with GPT-4o at 68% and open-source models below 33% accuracy.
SkyNative introduces an encoder-free architecture using raw patch tokens and modality-specific parameters in a unified autoregressive model to improve image-grounded reasoning in remote sensing vision-language tasks.
OCRBench provides the largest evaluation suite yet for OCR capabilities in large multimodal models, revealing gaps in multilingual, handwritten, and mathematical text handling.
citing papers explorer
-
Focusing Where Vision Matters: Selective Training for Large Vision Language Models via Visual Information Gain
Introduces VIG metric to measure visual contribution via perplexity reduction and applies it for selective training of LVLMs on high-VIG samples and tokens to improve grounding with reduced supervision.
-
MuirBench: A Comprehensive Benchmark for Robust Multi-image Understanding
MuirBench is a new benchmark showing that top multimodal LLMs struggle with robust multi-image understanding, with GPT-4o at 68% and open-source models below 33% accuracy.
-
SkyNative: A Native Multimodal Framework for Remote Sensing Visual Evidence Reasoning
SkyNative introduces an encoder-free architecture using raw patch tokens and modality-specific parameters in a unified autoregressive model to improve image-grounded reasoning in remote sensing vision-language tasks.
-
OCRBench: On the Hidden Mystery of OCR in Large Multimodal Models
OCRBench provides the largest evaluation suite yet for OCR capabilities in large multimodal models, revealing gaps in multilingual, handwritten, and mathematical text handling.