Geneval: An object-focused framework for evaluating text-to-image alignment

Dhruba Ghosh, Hannaneh Hajishirzi, Ludwig Schmidt · 2023

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it

browse 4 citing papers

citation-role summary

dataset 2 baseline 1

citation-polarity summary

use dataset 2 baseline 1

representative citing papers

SenseNova-U1: Unifying Multimodal Understanding and Generation with NEO-unify Architecture

cs.CV · 2026-05-12 · unverdicted · novelty 5.0

SenseNova-U1 presents native unified multimodal models that match top understanding VLMs while delivering strong performance in image generation, infographics, and interleaved tasks via the NEO-unify architecture.

OmniGen2: Towards Instruction-Aligned Multimodal Generation

cs.CV · 2025-06-23 · unverdicted · novelty 5.0

OmniGen2 introduces a unified generative model with two distinct decoding pathways and a decoupled image tokenizer that achieves competitive results on text-to-image and editing benchmarks plus state-of-the-art consistency among open-source models on the new OmniContext benchmark.

UniWorld-V1: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation

cs.CV · 2025-06-03 · unverdicted · novelty 5.0

UniWorld-V1 shows that semantic features from large multimodal models enable unified visual understanding and generation, achieving strong results on perception and manipulation tasks with only 2.7 million training samples.

BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset

cs.CV · 2025-05-14 · conditional · novelty 5.0

BLIP3-o uses a diffusion transformer to generate CLIP image features and a sequential pretraining strategy to build open models that perform strongly on both image understanding and generation benchmarks.

citing papers explorer

Showing 4 of 4 citing papers.

SenseNova-U1: Unifying Multimodal Understanding and Generation with NEO-unify Architecture cs.CV · 2026-05-12 · unverdicted · none · ref 43
SenseNova-U1 presents native unified multimodal models that match top understanding VLMs while delivering strong performance in image generation, infographics, and interleaved tasks via the NEO-unify architecture.
OmniGen2: Towards Instruction-Aligned Multimodal Generation cs.CV · 2025-06-23 · unverdicted · none · ref 20
OmniGen2 introduces a unified generative model with two distinct decoding pathways and a decoupled image tokenizer that achieves competitive results on text-to-image and editing benchmarks plus state-of-the-art consistency among open-source models on the new OmniContext benchmark.
UniWorld-V1: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation cs.CV · 2025-06-03 · unverdicted · none · ref 14
UniWorld-V1 shows that semantic features from large multimodal models enable unified visual understanding and generation, achieving strong results on perception and manipulation tasks with only 2.7 million training samples.
BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset cs.CV · 2025-05-14 · conditional · none · ref 8
BLIP3-o uses a diffusion transformer to generate CLIP image features and a sequential pretraining strategy to build open models that perform strongly on both image understanding and generation benchmarks.

Geneval: An object-focused framework for evaluating text-to-image alignment

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer