MMMU provides 11.5K heterogeneous college-level multimodal questions that current models solve at 56-59% accuracy, establishing a new standard for expert multimodal evaluation.
Learn to explain: Multimodal reasoning via thought chains for science question answering
5 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
polarities
background 2representative citing papers
DualToken disentangles semantics and appearance via separate codebooks in one tokenizer, reporting 0.25 rFID, 82% ImageNet zero-shot accuracy, and gains over VILA-U on understanding and generation benchmarks.
PyramidDrop accelerates LVLMs by staged, similarity-based dropping of visual tokens that become redundant in deeper layers, delivering 40% faster training and 55% lower inference cost with comparable accuracy.
A new 1.2M-caption dataset generated via GPT-4V improves LMMs on MME and MMBench by 222.8/22.0/22.3 and 2.7/1.3/1.5 points respectively when used for supervised fine-tuning.
TwigVLM adds a twig module to VLMs for twig-guided token pruning and self-speculative decoding, retaining 96% performance after pruning 88.9% visual tokens and delivering 154% speedup on long responses for LLaVA-1.5-7B.
citing papers explorer
-
MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI
MMMU provides 11.5K heterogeneous college-level multimodal questions that current models solve at 56-59% accuracy, establishing a new standard for expert multimodal evaluation.
-
DualToken: Towards Unifying Visual Understanding and Generation with Dual Visual Vocabularies
DualToken disentangles semantics and appearance via separate codebooks in one tokenizer, reporting 0.25 rFID, 82% ImageNet zero-shot accuracy, and gains over VILA-U on understanding and generation benchmarks.
-
PyramidDrop: Accelerating Your Large Vision-Language Models via Pyramid Visual Redundancy Reduction
PyramidDrop accelerates LVLMs by staged, similarity-based dropping of visual tokens that become redundant in deeper layers, delivering 40% faster training and 55% lower inference cost with comparable accuracy.
-
ShareGPT4V: Improving Large Multi-Modal Models with Better Captions
A new 1.2M-caption dataset generated via GPT-4V improves LMMs on MME and MMBench by 222.8/22.0/22.3 and 2.7/1.3/1.5 points respectively when used for supervised fine-tuning.
-
Growing a Multi-head Twig via Distillation and Reinforcement Learning to Accelerate Large Vision-Language Models
TwigVLM adds a twig module to VLMs for twig-guided token pruning and self-speculative decoding, retaining 96% performance after pruning 88.9% visual tokens and delivering 154% speedup on long responses for LLaVA-1.5-7B.