pith. sign in

arxiv: 2307.06930 · v3 · pith:TDBYX5XCnew · submitted 2023-07-13 · 💻 cs.CV · cs.CL

mBLIP: Efficient Bootstrapping of Multilingual Vision-LLMs

classification 💻 cs.CV cs.CL
keywords multilingualmodelsdataenglishimagellmsmblipvision-llms
0
0 comments X
read the original abstract

Modular vision-language models (Vision-LLMs) align pretrained image encoders with (frozen) large language models (LLMs) and post-hoc condition LLMs to `understand' the image input. With the abundance of readily available high-quality English image-text data as well as strong monolingual English LLMs, the research focus has been on English-only Vision-LLMs. Multilingual vision-language models are still predominantly obtained via expensive end-to-end pretraining, resulting in comparatively smaller models, trained on limited multilingual image data supplemented with text-only multilingual corpora. We present mBLIP, the first Vision-LLM leveraging multilingual LLMs, which we obtain in a computationally efficient manner on consumer-level hardware. To this end, we \textit{re-align} an image encoder previously tuned to an English LLM to a new, multilingual LLM using only a few million multilingual training examples derived from a mix of vision-and-language tasks, which we obtain by machine-translating high-quality English data to 95 languages. On the IGLUE benchmark and XM3600, mBLIP yields results competitive with state-of-the-art models and it greatly outperforms strong English-only Vision-LLMs like Llava 1.5. We release our model, code, and train data at \url{https://github.com/gregor-ge/mBLIP}.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. NRITYAM: Language Models Meet Art and Heritage of Dance

    cs.CL 2026-06 unverdicted novelty 6.0

    NRITYAM creates the largest multilingual benchmark for evaluating language models' understanding of dance traditions through expert-curated QA pairs.