GENFIG1 is a new benchmark that tests whether vision-language models can create effective Figure 1 visuals capturing the central scientific idea from paper text.
mplug-paperowl: Scientific diagram analysis with the multimodal large language model,
3 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
background 1polarities
background 1representative citing papers
DeepSeek-VL develops open-source 1.3B and 7B vision-language models that achieve competitive or state-of-the-art results on real-world visual-language benchmarks through diverse data curation, a hybrid vision encoder, and pretraining that preserves language capabilities.
This survey organizes the architectures, training strategies, data, evaluation methods, extensions, and challenges of Multimodal Large Language Models.
citing papers explorer
-
GENFIG1: Visual Summaries of Scholarly Work as a Challenge for Vision-Language Models
GENFIG1 is a new benchmark that tests whether vision-language models can create effective Figure 1 visuals capturing the central scientific idea from paper text.
-
DeepSeek-VL: Towards Real-World Vision-Language Understanding
DeepSeek-VL develops open-source 1.3B and 7B vision-language models that achieve competitive or state-of-the-art results on real-world visual-language benchmarks through diverse data curation, a hybrid vision encoder, and pretraining that preserves language capabilities.
-
A Survey on Multimodal Large Language Models
This survey organizes the architectures, training strategies, data, evaluation methods, extensions, and challenges of Multimodal Large Language Models.