arxiv: 2503.07265 · v3 · submitted 2025-03-10 · 💻 cs.CV · cs.AI· cs.CL

Recognition: 2 theorem links

· Lean Theorem

WISE: A World Knowledge-Informed Semantic Evaluation for Text-to-Image Generation

Yuwei Niu , Munan Ning , Mengren Zheng , Weiyang Jin , Bin Lin , Peng Jin , Jiaqi Liao , Chaoran Feng

show 3 more authors

Kunpeng Ning Bin Zhu Li Yuan

Authors on Pith no claims yet

Pith reviewed 2026-05-15 16:18 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.CL

keywords text-to-image generationworld knowledgesemantic evaluationbenchmarkWiScoremultimodal modelsknowledge integration

0 comments

The pith

Text-to-image models struggle to apply world knowledge in generated images according to a dedicated new benchmark.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces WISE as the first benchmark focused on testing world knowledge integration in text-to-image generation rather than just visual realism or basic prompt matching. It uses 1000 carefully designed prompts spread across 25 subdomains covering cultural common sense, spatio-temporal reasoning, and natural science. A new metric called WiScore evaluates how well the generated image aligns with the knowledge embedded in each prompt. When applied to 20 models, the results show consistent shortfalls in using that knowledge to produce accurate images, which matters for building systems that can depict real-world facts reliably instead of relying on superficial patterns.

Core claim

Existing text-to-image models exhibit significant limitations in their ability to effectively integrate and apply world knowledge during image generation, as shown through comprehensive testing on the WISE benchmark that challenges models with 1000 prompts across 25 subdomains in cultural common sense, spatio-temporal reasoning, and natural science, using WiScore to quantify knowledge-image alignment beyond CLIP scores.

What carries the argument

The WISE benchmark of 1000 crafted prompts across 25 subdomains paired with the WiScore metric that measures knowledge-image alignment.

If this is right

Future text-to-image models require improved mechanisms for incorporating world knowledge to move beyond current performance gaps.
Traditional metrics like CLIP are insufficient for evaluating complex semantic understanding in generated images.
Limitations appear consistently across dedicated text-to-image models and unified multimodal models.
Targeted advances in cultural, spatio-temporal, and scientific domains would be needed to close the observed gaps.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Stronger world knowledge integration could reduce factual inaccuracies and hallucinations in generated images for practical applications.
The benchmark structure offers a template for testing knowledge use in other generative tasks such as video or 3D synthesis.
Training data curation or architectural changes informed by these subdomains might yield measurable gains in model accuracy.

Load-bearing premise

The 1000 crafted prompts and 25 subdomains form an unbiased and comprehensive test of world knowledge integration without selection biases or design artifacts.

What would settle it

A model achieving consistently high WiScore values on the full set of 1000 prompts while producing images that correctly reflect the specified world knowledge would disprove the reported limitations.

read the original abstract

Text-to-Image (T2I) models are capable of generating high-quality artistic creations and visual content. However, existing research and evaluation standards predominantly focus on image realism and shallow text-image alignment, lacking a comprehensive assessment of complex semantic understanding and world knowledge integration in text-to-image generation. To address this challenge, we propose \textbf{WISE}, the first benchmark specifically designed for \textbf{W}orld Knowledge-\textbf{I}nformed \textbf{S}emantic \textbf{E}valuation. WISE moves beyond simple word-pixel mapping by challenging models with 1000 meticulously crafted prompts across 25 subdomains in cultural common sense, spatio-temporal reasoning, and natural science. To overcome the limitations of traditional CLIP metric, we introduce \textbf{WiScore}, a novel quantitative metric for assessing knowledge-image alignment. Through comprehensive testing of 20 models (10 dedicated T2I models and 10 unified multimodal models) using 1,000 structured prompts spanning 25 subdomains, our findings reveal significant limitations in their ability to effectively integrate and apply world knowledge during image generation, highlighting critical pathways for enhancing knowledge incorporation and application in next-generation T2I models. Code and data are available at \href{https://github.com/PKU-YuanGroup/WISE}{PKU-YuanGroup/WISE}.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The manuscript introduces WISE, the first benchmark for world knowledge-informed semantic evaluation of text-to-image models. It consists of 1000 meticulously crafted prompts across 25 subdomains covering cultural common sense, spatio-temporal reasoning, and natural science. The authors propose WiScore as a novel metric for knowledge-image alignment, evaluate 20 models (10 dedicated T2I and 10 unified multimodal), and conclude that current models show significant limitations in integrating and applying world knowledge, outlining pathways for improvement. Code and data are released.

Significance. If the prompt set proves unbiased and WiScore is shown to correlate with human judgments of knowledge alignment, the benchmark would fill a clear gap in T2I evaluation, which currently emphasizes realism and shallow alignment over complex semantic and world-knowledge integration, thereby providing actionable diagnostics for next-generation models.

major comments (3)

[Abstract] Abstract: the claim of 'comprehensive testing of 20 models' revealing 'significant limitations' is stated without any quantitative results, tables, error analysis, or statistical validation of WiScore, leaving the central empirical finding unsupported by visible evidence.
[Abstract] Prompt construction (Abstract): the 1000 prompts are described as 'meticulously crafted' across 25 subdomains, yet no details are supplied on the generation process, pre-commitment of the set before model evaluation, or controls for post-hoc selection bias; without such evidence the observed failures may reflect prompt artifacts rather than a general deficit in world-knowledge application.
[Abstract] WiScore (Abstract): the metric is introduced as overcoming CLIP limitations but no correlation study, inter-rater agreement, or human validation against knowledge-alignment ratings is reported; this is load-bearing because low WiScore values could track image quality or prompt adherence instead of the intended construct.

minor comments (1)

[Abstract] The release of code and data at the cited GitHub repository is a positive step for reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment point by point below, clarifying what is already in the full manuscript and indicating revisions to the abstract where appropriate to improve clarity and support for our claims.

read point-by-point responses

Referee: [Abstract] Abstract: the claim of 'comprehensive testing of 20 models' revealing 'significant limitations' is stated without any quantitative results, tables, error analysis, or statistical validation of WiScore, leaving the central empirical finding unsupported by visible evidence.

Authors: The abstract serves as a concise summary; the full quantitative results (including per-model WiScores in Table 2, error breakdowns by subdomain in Figure 3, and statistical analyses such as significance tests and confidence intervals) appear in Sections 4 and 5. We agree the abstract would be stronger with key numbers and will revise it to include the overall average WiScore, the gap between dedicated T2I and unified models, and a brief note on validation. revision: yes
Referee: [Abstract] Prompt construction (Abstract): the 1000 prompts are described as 'meticulously crafted' across 25 subdomains, yet no details are supplied on the generation process, pre-commitment of the set before model evaluation, or controls for post-hoc selection bias; without such evidence the observed failures may reflect prompt artifacts rather than a general deficit in world-knowledge application.

Authors: Section 3.1 fully describes the prompt generation process (expert curation from knowledge sources, subdomain balancing, pre-commitment to the fixed 1000-prompt set prior to any model runs, and bias controls including independent review and diversity metrics). We will add one sentence to the abstract summarizing this process to address concerns about potential artifacts. revision: yes
Referee: [Abstract] WiScore (Abstract): the metric is introduced as overcoming CLIP limitations but no correlation study, inter-rater agreement, or human validation against knowledge-alignment ratings is reported; this is load-bearing because low WiScore values could track image quality or prompt adherence instead of the intended construct.

Authors: Section 4.2 and Appendix B report the human validation study for WiScore, including Pearson correlation with human knowledge-alignment ratings (r = 0.81) and inter-rater agreement (Fleiss' kappa = 0.76). These results indicate WiScore tracks the intended construct rather than generic image quality or adherence. We will include a short clause in the revised abstract noting this human validation. revision: yes

Circularity Check

0 steps flagged

No circularity: benchmark and metric are newly defined against external model outputs

full rationale

The paper introduces WISE as a new benchmark consisting of 1000 prompts across 25 subdomains and WiScore as a new quantitative metric for knowledge-image alignment. No equations, fitted parameters, or derivation chains appear in the manuscript. The evaluation applies these constructs to 20 external models rather than reducing any result to a self-referential fit or self-citation. The central claim of limitations in world-knowledge integration rests on empirical testing of independent models, not on any tautological redefinition or imported uniqueness result. This is a standard benchmark paper with no load-bearing self-referential steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The claim depends on the unverified premise that the prompt set validly probes world knowledge and that WiScore correctly quantifies alignment; no free parameters or invented physical entities are introduced.

axioms (1)

domain assumption The 1000 prompts across 25 subdomains constitute valid and unbiased tests of complex semantic understanding and world knowledge integration.
Invoked in the abstract's description of benchmark design and model testing without reported validation or inter-rater checks.

invented entities (1)

WiScore no independent evidence
purpose: Quantitative metric for knowledge-image alignment that overcomes limitations of CLIP.
Newly introduced metric whose construction and validation details are absent from the abstract.

pith-pipeline@v0.9.0 · 5568 in / 1229 out tokens · 40893 ms · 2026-05-15T16:18:34.201092+00:00 · methodology

discussion (0)

Forward citations

Cited by 19 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

UniPath: Adaptive Coordination of Understanding and Generation for Unified Multimodal Reasoning
cs.MM 2026-05 unverdicted novelty 7.0

UniPath adaptively models coordination-path diversity in unified multimodal models by training a path-conditioned executor and using a lightweight planner for input-dependent selection, improving performance over fixe...
More Than Meets the Eye: Measuring the Semiotic Gap in Vision-Language Models via Semantic Anchorage
cs.CL 2026-04 unverdicted novelty 7.0

Vision-language models exhibit literal superiority bias on noun compounds, with photorealistic visuals linked to poorer idiomatic grounding via new DIVA benchmark and Δ metric.
Pseudo-Unification: Entropy Probing Reveals Divergent Information Patterns in Unified Multimodal Models
cs.CV 2026-04 unverdicted novelty 7.0

Unified multimodal models exhibit pseudo-unification due to modality-asymmetric entropy encoding and pattern-split responses between text and image generation.
Think in Strokes, Not Pixels: Process-Driven Image Generation via Interleaved Reasoning
cs.CV 2026-04 unverdicted novelty 7.0

Process-driven image generation decomposes text-to-image synthesis into interleaved cycles of textual planning, visual drafting, textual reflection, and visual refinement with dense consistency supervision.
Transfer between Modalities with MetaQueries
cs.CV 2025-04 unverdicted novelty 7.0

MetaQueries act as an efficient bridge allowing multimodal LLMs to augment diffusion-based image generation and editing without complex training or unfreezing the LLM backbone.
SCOPE: Structured Decomposition and Conditional Skill Orchestration for Complex Image Generation
cs.CV 2026-05 unverdicted novelty 6.0

SCOPE maintains semantic commitments via structured specifications and conditional skill orchestration, achieving 0.60 EGIP on the new Gen-Arena benchmark while outperforming baselines on WISE-V and MindBench.
DDA-Thinker: Decoupled Dual-Atomic Reinforcement Learning for Reasoning-Driven Image Editing
cs.CV 2026-04 unverdicted novelty 6.0

DDA-Thinker decouples planning from generation and applies dual-atomic RL with checklist-based rewards to boost reasoning in image editing, yielding competitive results on RISE-Bench and KRIS-Bench.
LLaDA2.0-Uni: Unifying Multimodal Understanding and Generation with Diffusion Large Language Model
cs.CV 2026-04 unverdicted novelty 6.0

LLaDA2.0-Uni unifies multimodal understanding and generation inside one discrete diffusion large language model with a semantic tokenizer, MoE backbone, and diffusion decoder.
Self-Adversarial One Step Generation via Condition Shifting
cs.CV 2026-04 unverdicted novelty 6.0

APEX derives self-adversarial gradients from condition-shifted velocity fields in flow models to achieve high-fidelity one-step generation, outperforming much larger models and multi-step teachers.
Gen-Searcher: Reinforcing Agentic Search for Image Generation
cs.CV 2026-03 unverdicted novelty 6.0

Gen-Searcher is the first search-augmented image generation agent trained with SFT followed by agentic RL using dual text and image rewards on custom datasets and the KnowGen benchmark.
From Broad Exploration to Stable Synthesis: Entropy-Guided Optimization for Autoregressive Image Generation
cs.LG 2026-03 unverdicted novelty 6.0

EG-GRPO improves autoregressive text-to-image models by reallocating RL updates according to token entropy, excluding low-entropy tokens from reward signals while adding entropy bonuses to high-entropy ones, yielding ...
MMaDA: Multimodal Large Diffusion Language Models
cs.CV 2025-05 unverdicted novelty 6.0

MMaDA is a unified multimodal diffusion model using mixed chain-of-thought fine-tuning and a new UniGRPO reinforcement learning algorithm that outperforms specialized models in reasoning, understanding, and text-to-im...
SenseNova-U1: Unifying Multimodal Understanding and Generation with NEO-unify Architecture
cs.CV 2026-05 unverdicted novelty 5.0

SenseNova-U1 presents native unified multimodal models that match top understanding VLMs while delivering strong performance in image generation, infographics, and interleaved tasks via the NEO-unify architecture.
Steering Visual Generation in Unified Multimodal Models with Understanding Supervision
cs.CV 2026-05 unverdicted novelty 5.0

Using understanding tasks as direct supervision during post-training improves image generation and editing in unified multimodal models.
Visual Generation in the New Era: An Evolution from Atomic Mapping to Agentic World Modeling
cs.CV 2026-04 unverdicted novelty 5.0

Visual generation models are evolving from passive renderers to interactive agentic world modelers, but current systems lack spatial reasoning, temporal consistency, and causal understanding, with evaluations overemph...
UniWorld-V1: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation
cs.CV 2025-06 unverdicted novelty 5.0

UniWorld-V1 shows that semantic features from large multimodal models enable unified visual understanding and generation, achieving strong results on perception and manipulation tasks with only 2.7 million training samples.
Emerging Properties in Unified Multimodal Pretraining
cs.CV 2025-05 unverdicted novelty 5.0

BAGEL is a unified decoder-only model that develops emerging complex multimodal reasoning abilities after pretraining on large-scale interleaved data and outperforms prior open-source unified models.
BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset
cs.CV 2025-05 conditional novelty 5.0

BLIP3-o uses a diffusion transformer to generate CLIP image features and a sequential pretraining strategy to build open models that perform strongly on both image understanding and generation benchmarks.
TorchUMM: A Unified Multimodal Model Codebase for Evaluation, Analysis, and Post-training
cs.AI 2026-04 unverdicted novelty 4.0

TorchUMM is the first unified codebase and benchmark suite for standardized evaluation of diverse unified multimodal models on understanding, generation, and editing tasks.

Reference graph

Works this paper leans on

61 extracted references · 61 canonical work pages · cited by 19 Pith papers · 19 internal anchors

[1]

Pixart-alpha: Fast training of diffusion transformer for photorealistic text-to-image synthe- sis, 2023

Junsong Chen, Jincheng Yu, Chongjian Ge, Lewei Yao, Enze Xie, Yue Wu, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, and Zhenguo Li. Pixart-alpha: Fast training of diffusion transformer for photorealistic text-to-image synthe- sis, 2023

work page 2023
[2]

BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset

Jiuhai Chen, Zhiyang Xu, Xichen Pan, Yushi Hu, Can Qin, Tom Goldstein, Lifu Huang, Tianyi Zhou, Saining Xie, Silvio Savarese, et al. Blip3-o: A family of fully open unified multimodal models-architecture, training and dataset.arXiv preprint arXiv:2505.09568, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Next token prediction towards multi- modal intelligence: A comprehensive survey.arXiv preprint arXiv:2412.18619, 2024

Liang Chen, Zekun Wang, Shuhuai Ren, Lei Li, Haozhe Zhao, Yunshui Li, Zefan Cai, Hongcheng Guo, Lei Zhang, Yizhe Xiong, et al. Next token prediction towards multi- modal intelligence: A comprehensive survey.arXiv preprint arXiv:2412.18619, 2024

work page arXiv 2024
[4]

Generative pretraining from pixels

Mark Chen, Alec Radford, Rewon Child, Jeffrey Wu, Heewoo Jun, David Luan, and Ilya Sutskever. Generative pretraining from pixels. InInternational conference on machine learning, pages 1691–1703. PMLR, 2020

work page 2020
[5]

Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling

Xiaokang Chen, Zhiyu Wu, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, and Chong Ruan. Janus-pro: Unified multimodal understanding and generation with data and model scaling.arXiv preprint arXiv:2501.17811, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

Emerging Properties in Unified Multimodal Pretraining

Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, et al. Emerging properties in unified multimodal pre- training.arXiv preprint arXiv:2505.14683, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[7]

Scaling rectified flow trans- formers for high-resolution image synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim En- tezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow trans- formers for high-resolution image synthesis. InForty-first International Conference on Machine Learning, 2024

work page 2024
[8]

Fluid: Scaling autoregressive text-to-image generative models with continuous tokens.arXiv preprint arXiv:2410.13863, 2024

Lijie Fan, Tianhong Li, Siyang Qin, Yuanzhen Li, Chen Sun, Michael Rubinstein, Deqing Sun, Kaiming He, and Yonglong Tian. Fluid: Scaling autoregressive text-to-image generative models with continuous tokens.arXiv preprint arXiv:2410.13863, 2024

work page arXiv 2024
[9]

Commonsense-t2i challenge: Can text-to-image generation models understand commonsense?arXiv preprint arXiv:2406.07546, 2024

Xingyu Fu, Muyu He, Yujie Lu, William Yang Wang, and Dan Roth. Commonsense-t2i challenge: Can text-to-image generation models understand commonsense?arXiv preprint arXiv:2406.07546, 2024

work page arXiv 2024
[10]

Seed-x: Multi- modal models with unified multi-granularity comprehension and generation.arXiv preprint arXiv:2404.14396, 2024

Yuying Ge, Sijie Zhao, Jinguo Zhu, Yixiao Ge, Kun Yi, Lin Song, Chen Li, Xiaohan Ding, and Ying Shan. Seed-x: Multi- modal models with unified multi-granularity comprehension and generation.arXiv preprint arXiv:2404.14396, 2024

work page arXiv 2024
[11]

Geneval: An object-focused framework for evaluating text-to- image alignment.Advances in Neural Information Processing Systems, 36, 2024

Dhruba Ghosh, Hannaneh Hajishirzi, and Ludwig Schmidt. Geneval: An object-focused framework for evaluating text-to- image alignment.Advances in Neural Information Processing Systems, 36, 2024

work page 2024
[12]

Infinity: Scaling bitwise autoregressive modeling for high-resolution image synthesis.arXiv preprint arXiv:2412.04431, 2024

Jian Han, Jinlai Liu, Yi Jiang, Bin Yan, Yuqi Zhang, Zehuan Yuan, Bingyue Peng, and Xiaobing Liu. Infinity: Scaling bitwise autoregressive modeling for high-resolution image synthesis.arXiv preprint arXiv:2412.04431, 2024

work page arXiv 2024
[13]

CLIPScore: A Reference-free Evaluation Metric for Image Captioning

Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. Clipscore: A reference-free evaluation metric for image captioning.arXiv preprint arXiv:2104.08718, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[14]

Gans trained by a two time-scale update rule converge to a local nash equilibrium

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bern- hard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30, 2017

work page 2017
[15]

Denoising diffu- sion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffu- sion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

work page 2020
[16]

ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment

Xiwei Hu, Rui Wang, Yixiao Fang, Bin Fu, Pei Cheng, and Gang Yu Ella. Equip diffusion models with llm for enhanced semantic alignment.arXiv preprint arXiv:2403.05135, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[17]

T2i-compbench: A comprehensive benchmark for open- world compositional text-to-image generation.Advances in Neural Information Processing Systems, 36:78723–78747, 2023

Kaiyi Huang, Kaiyue Sun, Enze Xie, Zhenguo Li, and Xihui Liu. T2i-compbench: A comprehensive benchmark for open- world compositional text-to-image generation.Advances in Neural Information Processing Systems, 36:78723–78747, 2023

work page 2023
[18]

Srum: Fine-grained self- rewarding for unified multimodal models.arXiv preprint arXiv:2510.12784, 2025

Weiyang Jin, Yuwei Niu, Jiaqi Liao, Chengqi Duan, Aoxue Li, Shenghua Gao, and Xihui Liu. Srum: Fine-grained self- rewarding for unified multimodal models.arXiv preprint arXiv:2510.12784, 2025

work page arXiv 2025
[19]

Unified language-vision pretraining with dynamic discrete visual tokenization.arXiv preprint arXiv:2309.04669, 2023

Yang Jin, Kun Xu, Liwei Chen, Chao Liao, Jianchao Tan, Bin Chen, Chenyi Lei, An Liu, Chengru Song, Xiaoqiang Lei, et al. Unified language-vision pretraining with dynamic discrete visual tokenization.arXiv preprint arXiv:2309.04669, 2023

work page arXiv 2023
[20]

Orthus: Autoregressive inter- leaved image-text generation with modality-specific heads

Siqi Kou, Jiachun Jin, Chang Liu, Ye Ma, Jian Jia, Quan Chen, Peng Jiang, and Zhijie Deng. Orthus: Autoregressive inter- leaved image-text generation with modality-specific heads. arXiv preprint arXiv:2412.00127, 2024

work page arXiv 2024
[21]

Black Forest Labs. Flux. https://github.com/black- forest-labs/flux, 2024

work page 2024
[22]

Genai-bench: Evaluating and improv- ing compositional text-to-visual generation.arXiv preprint arXiv:2406.13743, 2024

Baiqi Li, Zhiqiu Lin, Deepak Pathak, Jiayao Li, Yixin Fei, Kewen Wu, Tiffany Ling, Xide Xia, Pengchuan Zhang, Gra- ham Neubig, et al. Genai-bench: Evaluating and improv- ing compositional text-to-visual generation.arXiv preprint arXiv:2406.13743, 2024

work page arXiv 2024
[23]

Playground v2.5: Three Insights towards Enhancing Aesthetic Quality in Text-to-Image Generation

Daiqing Li, Aleks Kamko, Ehsan Akhgari, Ali Sabet, Linmiao Xu, and Suhail Doshi. Playground v2. 5: Three insights to- wards enhancing aesthetic quality in text-to-image generation. arXiv preprint arXiv:2402.17245, 2024

work page internal anchor Pith review arXiv 2024
[24]

Synergen-vl: Towards synergistic image understanding and generation with vision experts and token folding.arXiv preprint arXiv:2412.09604, 2024

Hao Li, Changyao Tian, Jie Shao, Xizhou Zhu, Zhaokai Wang, Jinguo Zhu, Wenhan Dou, Xiaogang Wang, Hongsheng Li, Lewei Lu, et al. Synergen-vl: Towards synergistic image understanding and generation with vision experts and token folding.arXiv preprint arXiv:2412.09604, 2024

work page arXiv 2024
[25]

Manzano: A simple and scalable unified multimodal model with a hybrid vision tokenizer.arXiv preprint arXiv:2509.16197, 2025

Yanghao Li, Rui Qian, Bowen Pan, Haotian Zhang, Haoshuo Huang, Bowen Zhang, Jialing Tong, Haoxuan You, Xianzhi Du, Zhe Gan, et al. Manzano: A simple and scalable unified multimodal model with a hybrid vision tokenizer.arXiv preprint arXiv:2509.16197, 2025

work page arXiv 2025
[26]

Dual diffusion for unified image generation and understanding.arXiv preprint arXiv:2501.00289, 2024

Zijie Li, Henry Li, Yichun Shi, Amir Barati Farimani, Yuval Kluger, Linjie Yang, and Peng Wang. Dual diffusion for unified image generation and understanding.arXiv preprint arXiv:2501.00289, 2024

work page arXiv 2024
[27]

Uniworld-v2: Reinforce image editing with diffusion negative-aware finetuning and mllm implicit feedback.arXiv preprint arXiv:2510.16888, 2025

Zongjian Li, Zheyuan Liu, Qihui Zhang, Bin Lin, Shenghai Yuan, Zhiyuan Yan, Yang Ye, Wangbo Yu, Yuwei Niu, and Li 9 Yuan. Uniworld-v2: Reinforce image editing with diffusion negative-aware finetuning and mllm implicit feedback.arXiv preprint arXiv:2510.16888, 2025

work page arXiv 2025
[28]

UniWorld-V1: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation

Bin Lin, Zongjian Li, Xinhua Cheng, Yuwei Niu, Yang Ye, Xianyi He, Shenghai Yuan, Wangbo Yu, Shaodong Wang, Yunyang Ge, et al. Uniworld: High-resolution semantic en- coders for unified visual understanding and generation.arXiv preprint arXiv:2506.03147, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[29]

Eval- uating text-to-visual generation with image-to-text generation, 2024.URL https://arxiv

Zhiqiu Lin, Deepak Pathak, Baiqi Li, Jiayao Li, Xide Xia, Graham Neubig, Pengchuan Zhang, and Deva Ramanan. Eval- uating text-to-visual generation with image-to-text generation, 2024.URL https://arxiv. org/abs/2404.01291, 2024

work page arXiv 2024
[30]

Janusflow: Harmonizing autoregres- sion and rectified flow for unified multimodal understanding and generation.arXiv preprint arXiv:2411.07975, 2024

Yiyang Ma, Xingchao Liu, Xiaokang Chen, Wen Liu, Chengyue Wu, Zhiyu Wu, Zizheng Pan, Zhenda Xie, Haowei Zhang, Liang Zhao, et al. Janusflow: Harmonizing autoregres- sion and rectified flow for unified multimodal understanding and generation.arXiv preprint arXiv:2411.07975, 2024

work page arXiv 2024
[31]

Phybench: A physical common- sense benchmark for evaluating text-to-image models.arXiv preprint arXiv:2406.11802, 2024

Fanqing Meng, Wenqi Shao, Lixin Luo, Yahong Wang, Yiran Chen, Quanfeng Lu, Yue Yang, Tianshuo Yang, Kaipeng Zhang, Yu Qiao, et al. Phybench: A physical common- sense benchmark for evaluating text-to-image models.arXiv preprint arXiv:2406.11802, 2024

work page arXiv 2024
[32]

GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models

Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. Glide: Towards photorealistic image generation and editing with text-guided diffusion models.arXiv preprint arXiv:2112.10741, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[33]

Transfer between Modalities with MetaQueries

Xichen Pan, Satya Narayan Shukla, Aashu Singh, Zhuokai Zhao, Shlok Kumar Mishra, Jialiang Wang, Zhiyang Xu, Jiuhai Chen, Kunpeng Li, Felix Juefei-Xu, et al. Trans- fer between modalities with metaqueries.arXiv preprint arXiv:2504.06256, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[34]

SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis.arXiv preprint arXiv:2307.01952, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[35]

Learning transferable visual models from natural language supervi- sion

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PMLR, 2021

work page 2021
[36]

Hierarchical Text-Conditional Image Generation with CLIP Latents

Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image genera- tion with clip latents.arXiv preprint arXiv:2204.06125, 1(2): 3, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[37]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022

work page 2022
[38]

Pho- torealistic text-to-image diffusion models with deep language understanding.Advances in neural information processing systems, 35:36479–36494, 2022

Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Pho- torealistic text-to-image diffusion models with deep language understanding.Advances in neural information processing systems, 35:36479–36494, 2022

work page 2022
[39]

Llamafu- sion: Adapting pretrained language models for multimodal generation.arXiv preprint arXiv:2412.15188, 2024

Weijia Shi, Xiaochuang Han, Chunting Zhou, Weixin Liang, Xi Victoria Lin, Luke Zettlemoyer, and Lili Yu. Llamafu- sion: Adapting pretrained language models for multimodal generation.arXiv preprint arXiv:2412.15188, 2024

work page arXiv 2024
[40]

Evaluating the generation of spatial relations in text and image generative models.arXiv preprint arXiv:2411.07664, 2024

Shang Hong Sim, Clarence Lee, Alvin Tan, and Cheston Tan. Evaluating the generation of spatial relations in text and image generative models.arXiv preprint arXiv:2411.07664, 2024

work page arXiv 2024
[41]

Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation

Peize Sun, Yi Jiang, Shoufa Chen, Shilong Zhang, Bingyue Peng, Ping Luo, and Zehuan Yuan. Autoregressive model beats diffusion: Llama for scalable image generation.arXiv preprint arXiv:2406.06525, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[42]

Mul- timodal latent language modeling with next-token diffusion

Yutao Sun, Hangbo Bao, Wenhui Wang, Zhiliang Peng, Li Dong, Shaohan Huang, Jianyong Wang, and Furu Wei. Mul- timodal latent language modeling with next-token diffusion. arXiv preprint arXiv:2412.08635, 2024

work page arXiv 2024
[43]

Chameleon: Mixed-Modal Early-Fusion Foundation Models

Chameleon Team. Chameleon: Mixed-modal early-fusion foundation models.arXiv preprint arXiv:2405.09818, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[44]

Visual autoregressive modeling: Scalable im- age generation via next-scale prediction.arXiv preprint arXiv:2404.02905, 2024

Keyu Tian, Yi Jiang, Zehuan Yuan, Bingyue Peng, and Li- wei Wang. Visual autoregressive modeling: Scalable im- age generation via next-scale prediction.arXiv preprint arXiv:2404.02905, 2024

work page arXiv 2024
[45]

Metamorph: Multimodal under- standing and generation via instruction tuning.arXiv preprint arXiv:2412.14164, 2024

Shengbang Tong, David Fan, Jiachen Zhu, Yunyang Xiong, Xinlei Chen, Koustuv Sinha, Michael Rabbat, Yann LeCun, Saining Xie, and Zhuang Liu. Metamorph: Multimodal under- standing and generation via instruction tuning.arXiv preprint arXiv:2412.14164, 2024

work page arXiv 2024
[46]

Neural discrete representation learning.Advances in neural information pro- cessing systems, 30, 2017

Aaron Van Den Oord, Oriol Vinyals, et al. Neural discrete representation learning.Advances in neural information pro- cessing systems, 30, 2017

work page 2017
[47]

Emu3: Next-Token Prediction is All You Need

Xinlong Wang, Xiaosong Zhang, Zhengxiong Luo, Quan Sun, Yufeng Cui, Jinsheng Wang, Fan Zhang, Yueze Wang, Zhen Li, Qiying Yu, et al. Emu3: Next-token prediction is all you need.arXiv preprint arXiv:2409.18869, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[48]

Janus: Decoupling visual encoding for unified multimodal understanding and generation.arXiv preprint arXiv:2410.13848, 2024

Chengyue Wu, Xiaokang Chen, Zhiyu Wu, Yiyang Ma, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, Chong Ruan, et al. Janus: Decoupling visual encoding for unified multimodal understanding and generation.arXiv preprint arXiv:2410.13848, 2024

work page arXiv 2024
[49]

Qwen-Image Technical Report

Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al. Qwen-image technical report.arXiv preprint arXiv:2508.02324, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[50]

Liquid: Lan- guage models are scalable multi-modal generators.arXiv preprint arXiv:2412.04332, 2024

Junfeng Wu, Yi Jiang, Chuofan Ma, Yuliang Liu, Hengshuang Zhao, Zehuan Yuan, Song Bai, and Xiang Bai. Liquid: Lan- guage models are scalable multi-modal generators.arXiv preprint arXiv:2412.04332, 2024

work page arXiv 2024
[51]

Openuni: A simple baseline for unified multimodal understanding and generation

Size Wu, Zhonghua Wu, Zerui Gong, Qingyi Tao, Sheng Jin, Qinyue Li, Wei Li, and Chen Change Loy. Openuni: A simple baseline for unified multimodal understanding and generation. arXiv preprint arXiv:2505.23661, 2025

work page arXiv 2025
[52]

Human Preference Score v2: A Solid Benchmark for Evaluating Human Preferences of Text-to-Image Synthesis

Xiaoshi Wu, Yiming Hao, Keqiang Sun, Yixiong Chen, Feng Zhu, Rui Zhao, and Hongsheng Li. Human preference score v2: A solid benchmark for evaluating human preferences of text-to-image synthesis.arXiv preprint arXiv:2306.09341, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[53]

Conceptmix: A compositional image 10 generation benchmark with controllable difficulty.arXiv preprint arXiv:2408.14339, 2024

Xindi Wu, Dingli Yu, Yangsibo Huang, Olga Russakovsky, and Sanjeev Arora. Conceptmix: A compositional image 10 generation benchmark with controllable difficulty.arXiv preprint arXiv:2408.14339, 2024

work page arXiv 2024
[54]

Vila-u: a unified foundation model inte- grating visual understanding and generation.arXiv preprint arXiv:2409.04429, 2024

Yecheng Wu, Zhuoyang Zhang, Junyu Chen, Haotian Tang, Dacheng Li, Yunhao Fang, Ligeng Zhu, Enze Xie, Hongxu Yin, Li Yi, et al. Vila-u: a unified foundation model inte- grating visual understanding and generation.arXiv preprint arXiv:2409.04429, 2024

work page arXiv 2024
[55]

Show-o: One Single Transformer to Unify Multimodal Understanding and Generation

Jinheng Xie, Weijia Mao, Zechen Bai, David Junhao Zhang, Weihao Wang, Kevin Qinghong Lin, Yuchao Gu, Zhijie Chen, Zhenheng Yang, and Mike Zheng Shou. Show-o: One single transformer to unify multimodal understanding and genera- tion.arXiv preprint arXiv:2408.12528, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[56]

Imagereward: Learning and evaluating human preferences for text-to-image generation.Advances in Neural Information Processing Sys- tems, 36, 2024

Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai Li, Ming Ding, Jie Tang, and Yuxiao Dong. Imagereward: Learning and evaluating human preferences for text-to-image generation.Advances in Neural Information Processing Sys- tems, 36, 2024

work page 2024
[57]

Kola: Carefully benchmarking world knowledge of large language models.arXiv preprint arXiv:2306.09296, 2023

Jifan Yu, Xiaozhi Wang, Shangqing Tu, Shulin Cao, Daniel Zhang-Li, Xin Lv, Hao Peng, Zijun Yao, Xiaohan Zhang, Hanming Li, et al. Kola: Carefully benchmarking world knowledge of large language models.arXiv preprint arXiv:2306.09296, 2023

work page arXiv 2023
[58]

When and why vision-language models behave like bags-of-words, and what to do about it? arXiv preprint arXiv:2210.01936, 2022

Mert Yuksekgonul, Federico Bianchi, Pratyusha Kalluri, Dan Jurafsky, and James Zou. When and why vision-language models behave like bags-of-words, and what to do about it? arXiv preprint arXiv:2210.01936, 2022

work page arXiv 2022
[59]

Text-to-image diffusion models in generative ai: A survey.arXiv preprint arXiv:2303.07909, 2023

Chenshuang Zhang, Chaoning Zhang, Mengchun Zhang, and In So Kweon. Text-to-image diffusion models in generative ai: A survey.arXiv preprint arXiv:2303.07909, 2023

work page arXiv 2023
[60]

A Survey of Large Language Models

Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, et al. A survey of large language models. arXiv preprint arXiv:2303.18223, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[61]

Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model

Chunting Zhou, Lili Yu, Arun Babu, Kushal Tirumala, Michi- hiro Yasunaga, Leonid Shamis, Jacob Kahn, Xuezhe Ma, Luke Zettlemoyer, and Omer Levy. Transfusion: Predict the next token and diffuse images with one multi-modal model. arXiv preprint arXiv:2408.11039, 2024. 11 A. WISE Category Descriptions WISE encompasses a broad spectrum of knowledge cat- egori...

work page internal anchor Pith review Pith/arXiv arXiv 2024