VPiT enables pretrained LLMs to perform both visual understanding and generation by predicting discrete text tokens and continuous visual tokens, with understanding data proving more effective than generation-specific data.
Textcaps: a dataset for image captioning with reading comprehension, 2020
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
years
2024 2representative citing papers
LogicVista is a new benchmark dataset with 448 visual logic questions that evaluates multimodal LLMs on five reasoning tasks covering nine capabilities.
citing papers explorer
-
MetaMorph: Multimodal Understanding and Generation via Instruction Tuning
VPiT enables pretrained LLMs to perform both visual understanding and generation by predicting discrete text tokens and continuous visual tokens, with understanding data proving more effective than generation-specific data.
-
LogicVista: Multimodal LLM Logical Reasoning Benchmark in Visual Contexts
LogicVista is a new benchmark dataset with 448 visual logic questions that evaluates multimodal LLMs on five reasoning tasks covering nine capabilities.