Honeybee: Locality-enhanced projector for multimodal llm

Junbum Cha, Wooyoung Kang, Jonghwan Mun, Byungseok Roh · 2023 · arXiv 2312.06742

5 Pith papers cite this work. Polarity classification is still indexing.

5 Pith papers citing it

read on arXiv browse 5 citing papers

citation-role summary

background 2

citation-polarity summary

background 2

representative citing papers

MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training

cs.CV · 2024-03-14 · unverdicted · novelty 6.0

MM1 models achieve state-of-the-art few-shot multimodal results by pre-training on a careful mix of image-caption, interleaved, and text-only data with optimized image encoders.

MoE-LLaVA: Mixture of Experts for Large Vision-Language Models

cs.CV · 2024-01-29 · conditional · novelty 6.0

MoE-LLaVA applies mixture-of-experts sparsity to LVLMs via MoE-Tuning, delivering LLaVA-1.5-7B level visual understanding and better hallucination resistance with only ~3B active parameters.

Aligning Modalities in Vision Large Language Models via Preference Fine-tuning

cs.LG · 2024-02-18 · unverdicted · novelty 5.0

POVID generates AI-created preference data to fine-tune vision-language models with DPO, reducing hallucinations and improving benchmark scores.

PaliGemma: A versatile 3B VLM for transfer

cs.CV · 2024-07-10 · unverdicted · novelty 4.0

PaliGemma is an open 3B VLM based on SigLIP and Gemma that achieves strong performance on nearly 40 diverse open-world tasks including benchmarks, remote-sensing, and segmentation.

MobileVLM V2: Faster and Stronger Baseline for Vision Language Model

cs.CV · 2024-02-06 · unverdicted · novelty 4.0

MobileVLM V2 shows that 1.7B and 3B parameter vision-language models can reach or exceed the performance of 3B and 7B+ models on common VLM benchmarks via targeted design and data improvements.

citing papers explorer

Showing 5 of 5 citing papers.

MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training cs.CV · 2024-03-14 · unverdicted · none · ref 12
MM1 models achieve state-of-the-art few-shot multimodal results by pre-training on a careful mix of image-caption, interleaved, and text-only data with optimized image encoders.
MoE-LLaVA: Mixture of Experts for Large Vision-Language Models cs.CV · 2024-01-29 · conditional · none · ref 4
MoE-LLaVA applies mixture-of-experts sparsity to LVLMs via MoE-Tuning, delivering LLaVA-1.5-7B level visual understanding and better hallucination resistance with only ~3B active parameters.
Aligning Modalities in Vision Large Language Models via Preference Fine-tuning cs.LG · 2024-02-18 · unverdicted · none · ref 145
POVID generates AI-created preference data to fine-tune vision-language models with DPO, reducing hallucinations and improving benchmark scores.
PaliGemma: A versatile 3B VLM for transfer cs.CV · 2024-07-10 · unverdicted · none · ref 20
PaliGemma is an open 3B VLM based on SigLIP and Gemma that achieves strong performance on nearly 40 diverse open-world tasks including benchmarks, remote-sensing, and segmentation.
MobileVLM V2: Faster and Stronger Baseline for Vision Language Model cs.CV · 2024-02-06 · unverdicted · none · ref 7
MobileVLM V2 shows that 1.7B and 3B parameter vision-language models can reach or exceed the performance of 3B and 7B+ models on common VLM benchmarks via targeted design and data improvements.

Honeybee: Locality-enhanced projector for multimodal llm

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer