hub

Generative multimodal models are in-context learners.CoRR, abs/2312.13286, 2023a

Quan Sun, Yufeng Cui, Xiaosong Zhang, Fan Zhang, Qiying Yu, Zhengxiong Luo, Yueze Wang, Yongming Rao, Jingjing Liu, Tiejun Huang, et al · 2023 · arXiv 2312.13286

10 Pith papers cite this work. Polarity classification is still indexing.

10 Pith papers citing it

read on arXiv browse 10 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 3 baseline 1

citation-polarity summary

background 3 baseline 1

representative citing papers

MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI

cs.CL · 2023-11-27 · unverdicted · novelty 8.0

MMMU provides 11.5K heterogeneous college-level multimodal questions that current models solve at 56-59% accuracy, establishing a new standard for expert multimodal evaluation.

MuirBench: A Comprehensive Benchmark for Robust Multi-image Understanding

cs.CV · 2024-06-13 · conditional · novelty 7.0

MuirBench is a new benchmark showing that top multimodal LLMs struggle with robust multi-image understanding, with GPT-4o at 68% and open-source models below 33% accuracy.

MMaDA: Multimodal Large Diffusion Language Models

cs.CV · 2025-05-21 · unverdicted · novelty 6.0

MMaDA is a unified multimodal diffusion model using mixed chain-of-thought fine-tuning and a new UniGRPO reinforcement learning algorithm that outperforms specialized models in reasoning, understanding, and text-to-image tasks.

SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation

cs.CV · 2024-04-22 · unverdicted · novelty 6.0

SEED-X is a unified multimodal foundation model that handles multi-granularity visual semantics for both comprehension and generation across arbitrary image sizes and ratios.

MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training

cs.CV · 2024-03-14 · unverdicted · novelty 6.0

MM1 models achieve state-of-the-art few-shot multimodal results by pre-training on a careful mix of image-caption, interleaved, and text-only data with optimized image encoders.

CogVLM: Visual Expert for Pretrained Language Models

cs.CV · 2023-11-06 · conditional · novelty 6.0

CogVLM adds a trainable visual expert inside frozen language model layers for deep vision-language fusion and reports state-of-the-art results on ten cross-modal benchmarks while preserving NLP performance.

Grounding Everything in Tokens for Multimodal Large Language Models

cs.CV · 2025-12-11 · unverdicted · novelty 5.0

GETok partitions images with grid tokens and refines locations via offset tokens to enable better native 2D spatial reasoning in MLLMs.

Show-o: One Single Transformer to Unify Multimodal Understanding and Generation

cs.CV · 2024-08-22 · unverdicted · novelty 5.0

Show-o unifies autoregressive and discrete diffusion modeling inside one transformer to support multimodal understanding and generation tasks with competitive benchmark performance.

Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models

cs.CV · 2024-03-27 · unverdicted · novelty 5.0

Mini-Gemini enhances VLMs via high-resolution visual refinement, curated reasoning data, and self-guided generation to reach leading zero-shot benchmark results across 2B-34B LLMs.

PaliGemma: A versatile 3B VLM for transfer

cs.CV · 2024-07-10 · unverdicted · novelty 4.0

PaliGemma is an open 3B VLM based on SigLIP and Gemma that achieves strong performance on nearly 40 diverse open-world tasks including benchmarks, remote-sensing, and segmentation.

citing papers explorer

Showing 10 of 10 citing papers.

MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI cs.CL · 2023-11-27 · unverdicted · none · ref 70
MMMU provides 11.5K heterogeneous college-level multimodal questions that current models solve at 56-59% accuracy, establishing a new standard for expert multimodal evaluation.
MuirBench: A Comprehensive Benchmark for Robust Multi-image Understanding cs.CV · 2024-06-13 · conditional · none · ref 56
MuirBench is a new benchmark showing that top multimodal LLMs struggle with robust multi-image understanding, with GPT-4o at 68% and open-source models below 33% accuracy.
MMaDA: Multimodal Large Diffusion Language Models cs.CV · 2025-05-21 · unverdicted · none · ref 7
MMaDA is a unified multimodal diffusion model using mixed chain-of-thought fine-tuning and a new UniGRPO reinforcement learning algorithm that outperforms specialized models in reasoning, understanding, and text-to-image tasks.
SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation cs.CV · 2024-04-22 · unverdicted · none · ref 25
SEED-X is a unified multimodal foundation model that handles multi-granularity visual semantics for both comprehension and generation across arbitrary image sizes and ratios.
MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training cs.CV · 2024-03-14 · unverdicted · none · ref 105
MM1 models achieve state-of-the-art few-shot multimodal results by pre-training on a careful mix of image-caption, interleaved, and text-only data with optimized image encoders.
CogVLM: Visual Expert for Pretrained Language Models cs.CV · 2023-11-06 · conditional · none · ref 21
CogVLM adds a trainable visual expert inside frozen language model layers for deep vision-language fusion and reports state-of-the-art results on ten cross-modal benchmarks while preserving NLP performance.
Grounding Everything in Tokens for Multimodal Large Language Models cs.CV · 2025-12-11 · unverdicted · none · ref 59
GETok partitions images with grid tokens and refines locations via offset tokens to enable better native 2D spatial reasoning in MLLMs.
Show-o: One Single Transformer to Unify Multimodal Understanding and Generation cs.CV · 2024-08-22 · unverdicted · none · ref 17
Show-o unifies autoregressive and discrete diffusion modeling inside one transformer to support multimodal understanding and generation tasks with competitive benchmark performance.
Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models cs.CV · 2024-03-27 · unverdicted · none · ref 47
Mini-Gemini enhances VLMs via high-resolution visual refinement, curated reasoning data, and self-guided generation to reach leading zero-shot benchmark results across 2B-34B LLMs.
PaliGemma: A versatile 3B VLM for transfer cs.CV · 2024-07-10 · unverdicted · none · ref 123
PaliGemma is an open 3B VLM based on SigLIP and Gemma that achieves strong performance on nearly 40 diverse open-world tasks including benchmarks, remote-sensing, and segmentation.

Generative multimodal models are in-context learners.CoRR, abs/2312.13286, 2023a

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer