InterleaveThinker: Reinforcing Agentic Interleaved Generation

Dian Zheng; Harry Lee; Hongsheng Li; Kaituo Feng; Manyuan Zhang; Ray Zhang; Zoey Guo

arxiv: 2606.13679 · v1 · pith:KFJKHTMYnew · submitted 2026-06-11 · 💻 cs.CV

InterleaveThinker: Reinforcing Agentic Interleaved Generation

Dian Zheng , Harry Lee , Manyuan Zhang , Kaituo Feng , Zoey Guo , Ray Zhang , Hongsheng Li This is my paper

Pith reviewed 2026-06-27 06:40 UTC · model grok-4.3

classification 💻 cs.CV

keywords interleaved generationmulti-agent pipelineimage generationreinforcement learningplanner agentcritic agentGRPO

0 comments

The pith

A planner-critic pipeline adds interleaved text-image generation to any existing image generator.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces InterleaveThinker as a multi-agent system that gives any image generator the ability to output sequences mixing text and images. A planner agent sets up the required sequence steps while a critic agent checks each output against the plan and issues corrected instructions for regeneration. The pipeline first uses supervised fine-tuning on two constructed datasets to learn formats, then applies reinforcement learning on a third dataset with accuracy and step-wise rewards so that single-step updates can steer trajectories of 25 or more generator calls. This addresses the limitation that current image generators and unified models cannot reliably produce the text-image interleaving needed for narratives, guidance, and manipulation tasks. If the approach holds, open generators can reach performance levels previously limited to closed advanced models on these benchmarks.

Core claim

InterleaveThinker is the first multi-agent pipeline that endows any image generator with interleaved generation capabilities by employing a planner agent to organize image-text input sequences and a critic agent to evaluate generator outputs, identify deviations, and refine instructions for regeneration. The system builds Interleave-Planner-SFT-80k and Interleave-Critic-SFT-112k for format cold-start, then uses Interleave-Critic-RL-13k with GRPO to reinforce step-wise instruction correction. Because full-trajectory optimization over 25+ generator calls is impractical, accuracy reward and step-wise reward are proposed so that single-step RL can guide the entire trajectory, yielding performanc

What carries the argument

The planner-critic multi-agent pipeline trained first with SFT then reinforced by single-step RL using accuracy and step-wise rewards.

If this is right

Performance improves across various image generators on interleaved tasks.
Results reach levels comparable to Nano Banana and GPT-5 on interleaved generation benchmarks.
Base models show substantial gains on reasoning benchmarks such as WISE and RISE when using 4-step FLUX.2-klein.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same planner-critic structure could be tested on sequential generation tasks beyond images, such as video frame sequences.
Single-step RL with these rewards may lower the compute barrier for applying agentic oversight to even longer generation chains.
Agentic correction layers might serve as a general way to add capabilities missing from base generator architectures without retraining them.

Load-bearing premise

The accuracy and step-wise rewards used in single-step RL will steer the full multi-step interleaved trajectory without the critic introducing compounding errors over 25 or more generator calls.

What would settle it

Measure output quality on an interleaved generation benchmark that requires trajectories longer than 25 generator calls; if quality shows no gain or degrades relative to the base generator, the central claim does not hold.

Figures

Figures reproduced from arXiv: 2606.13679 by Dian Zheng, Harry Lee, Hongsheng Li, Kaituo Feng, Manyuan Zhang, Ray Zhang, Zoey Guo.

**Figure 1.** Figure 1: Capabilities of InterleaveThinker, consisting of interleaved generation with various types inputs, real-world action interaction, and robotic manipulation. Gray: inputs, blue: outputs. ABSTRACT Recent image generators have demonstrated impressive photorealism and instruction-following capabilities in single-image generation and editing. However, constrained by their architectures, they cannot achieve inter… view at source ↗

**Figure 2.** Figure 2: Problems in image generator and UMM for interleaved generation. Highlight in red boxes. reasoning-based benchmarks. Specifically, we observe substantial improvements on the WISE benchmark (increasing from 0.47 to 0.73) and the RISE benchmark (leaping from 13.3 to 28.9). These results highlight the immense potential of multi-agent collaboration in unlocking complex, sequential reasoning and generation capab… view at source ↗

**Figure 3.** Figure 3: Overview of InterleaveThinker. t means the refinement iterations [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: The working flow of InterleaveThinker. Note: For the initial step i=1, I0 is set as a blank white image to maintain input consistency. This generation-evaluation loop (Stage 2↔3) iterates until a positive execution judgment (True) is obtained, or a maximum number of iterations Tmax is reached. Upon satisfaction, the pipeline finalizes Ii and ai , appends them to the output sequence, and proceeds to step i … view at source ↗

**Figure 5.** Figure 5: Illustration of Our Data Construction Pipeline. process do not naturally exist. To address this, as shown in [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: Comparison with Emu3.5 and Nano Banana Pro in pure-text input interleaved generation. Step 1: Gather Your Materials. You'll need a glass container, small electronics like LEDs and wire, bioluminescent spores.... Step 2: Build the Foundation. Layer your container with small rocks for drainage,..., and then add a layer of potting soil for your ecosystem. Step 3: Integrate the Cybernetics. Carefully embed you… view at source ↗

**Figure 7.** Figure 7: Comparison with Emu3.5 and Nano Banana Pro in multi-modal input interleaved generation. 9 [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗

**Figure 8.** Figure 8: Failing case of InterleaveThinker+FLUX.2-klein. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗

read the original abstract

Recent image generators have demonstrated impressive photorealism and instruction-following capabilities in single-image generation and editing. However, constrained by their architectures, they cannot achieve interleaved generation (text-image sequence), which has crucial applications in visual narratives, guidance, and embodied manipulation. Even the latest open-source Unified Multimodal Models (UMMs) exhibit limited performance in this regard. In this paper, we introduce InterleaveThinker, the first multi-agent pipeline designed to endow any existing image generator with interleaved generation capabilities. Specifically, we employ a planner agent to organize the image-text input sequence, instructing the image generator on the required execution at each step. Subsequently, we introduce a critic agent to evaluate the generator's outputs, identify samples that deviate from the planned instructions, and refine the instructions for regeneration. To implement this pipeline, we construct the Interleave-Planner-SFT-80k and Interleave-Critic-SFT-112k to perform a format cold-start. Then we develop Interleave-Critic-RL-13k to reinforce the step-wise instruction correction capability within a generation trajectory using GRPO. Since a single interleaved generation trajectory may involve over 25 generator calls, optimizing the entire trajectory is computationally impractical. Therefore, we propose accuracy reward and step-wise reward, allowing single-step RL to effectively guide the entire generation trajectory. The results show that InterleaveThinker improves performance across various image generators. On interleaved generation benchmarks, it achieves performance comparable to Nano Banana and GPT-5. Surprisingly, it also significantly enhances the base model on reasoning-based benchmarks; for example, on 4-step FLUX.2-klein, we observe substantial gains on WISE and RISE.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

InterleaveThinker adds a planner-critic layer to give base image generators interleaved text-image output via new SFT datasets and single-step GRPO, but the reward design for long trajectories lacks supporting checks.

read the letter

The main point is a planner that lays out the interleaved sequence and a critic that spots and fixes bad generator steps, trained first with SFT on the new Interleave-Planner-SFT-80k and Interleave-Critic-SFT-112k sets, then reinforced with GRPO on Interleave-Critic-RL-13k using accuracy and step-wise rewards.

This is new in the specific multi-agent framing and the dataset construction for this exact task. The modular approach is useful because it avoids retraining the base generator and directly targets the gap in current UMMs for text-image sequences.

The soft spot is the RL shortcut. The paper notes full-trajectory optimization is impractical with 25-plus calls, so single-step GRPO is meant to steer the whole run. That only holds if the critic stays stable and the rewards stop drift; nothing in the abstract shows ablations or error tracking that would confirm it works. The stated gains on WISE, RISE, and parity with Nano Banana or GPT-5 are therefore hard to weigh without the tables or controls.

This is for people building agentic multimodal tools who need a practical extension method. The thinking is straightforward and the problem is real, so the paper deserves a serious referee to inspect the experiments and the reward stability claim.

Send it to review.

Referee Report

1 major / 0 minor

Summary. The manuscript introduces InterleaveThinker, a multi-agent pipeline consisting of a planner agent that organizes text-image sequences for any base image generator and a critic agent that evaluates outputs, detects deviations, and refines instructions for regeneration. It constructs Interleave-Planner-SFT-80k and Interleave-Critic-SFT-112k for format cold-start SFT, then Interleave-Critic-RL-13k for GRPO-based reinforcement of the critic using proposed accuracy and step-wise rewards. The design addresses the impracticality of full-trajectory optimization for sequences with over 25 generator calls, claiming performance gains across generators, comparability to Nano Banana and GPT-5 on interleaved benchmarks, and substantial improvements on WISE and RISE for 4-step FLUX.2-klein.

Significance. If the results hold under rigorous verification, the work would be significant for demonstrating a practical agentic method to retrofit interleaved generation onto existing image generators without retraining their weights. The explicit construction of large-scale SFT datasets and the reward formulation to enable single-step RL approximation of multi-step trajectories are concrete contributions that could be adopted or extended in vision-language agent research.

major comments (1)

[Abstract] Abstract: The central claim that the accuracy reward and step-wise reward 'allow single-step RL to effectively guide the entire generation trajectory' is load-bearing for the reported benchmark comparability and WISE/RISE gains, yet no ablation, error accumulation analysis, or stability check over 25+ steps is referenced to confirm that critic corrections do not introduce compounding deviations; this assumption directly determines whether the single-step GRPO design supports the full pipeline results.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on the abstract's central claim. We address the concern point by point below and commit to strengthening the manuscript accordingly.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that the accuracy reward and step-wise reward 'allow single-step RL to effectively guide the entire generation trajectory' is load-bearing for the reported benchmark comparability and WISE/RISE gains, yet no ablation, error accumulation analysis, or stability check over 25+ steps is referenced to confirm that critic corrections do not introduce compounding deviations; this assumption directly determines whether the single-step GRPO design supports the full pipeline results.

Authors: We agree that the absence of explicit ablations on error accumulation and long-horizon stability leaves the claim under-supported in the current manuscript. The step-wise reward is formulated to deliver immediate per-step feedback that corrects deviations before propagation, while the accuracy reward anchors overall trajectory fidelity; the observed gains on WISE/RISE for 4-step FLUX.2-klein and parity with Nano Banana/GPT-5 on interleaved benchmarks provide indirect empirical corroboration. Nevertheless, to directly validate that critic interventions do not compound over 25+ generator calls, we will add a dedicated analysis section (including ablation of the step-wise reward, per-step deviation tracking, and stability metrics on extended trajectories) in the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation relies on external benchmarks and standard RL techniques

full rationale

The paper presents a multi-agent pipeline (planner + critic) trained via SFT on constructed datasets followed by GRPO RL with accuracy and step-wise rewards. No equations, fitted parameters renamed as predictions, or self-citation chains reduce any central claim to its own inputs by construction. Results are evaluated against external benchmarks (WISE, RISE) and compared to independent models (Nano Banana, GPT-5). The design choice to use single-step RL for multi-step trajectories is an explicit engineering assumption, not a self-referential derivation.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The central claim rests on the effectiveness of the critic's step-wise correction and the two custom rewards guiding long trajectories; these are introduced without independent verification in the abstract.

free parameters (2)

accuracy reward and step-wise reward formulation
Ad-hoc rewards designed to enable single-step RL for full trajectories; their exact weighting and definition are not derived from first principles.
dataset sizes and construction rules for Interleave-Planner-SFT-80k etc.
Synthetic data generation process involves choices that affect training.

axioms (1)

domain assumption The critic agent can reliably detect and correct deviations from the planner's instructions in complex interleaved sequences
This is required for the pipeline to improve rather than degrade generator output.

pith-pipeline@v0.9.1-grok · 5851 in / 1288 out tokens · 20994 ms · 2026-06-27T06:40:58.723555+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

82 extracted references · 22 linked inside Pith

[1]

Flux.https://github.com/black-forest-labs/flux, 2024

Black Forest Labs. Flux.https://github.com/black-forest-labs/flux, 2024

2024
[2]

FLUX.2: Frontier Visual Intelligence.https://bfl.ai/blog/flux-2, 2025

Black Forest Labs. FLUX.2: Frontier Visual Intelligence.https://bfl.ai/blog/flux-2, 2025

2025
[3]

Qwen-image technical report.arXiv preprint arXiv:2508.02324, 2025

Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al. Qwen-image technical report.arXiv preprint arXiv:2508.02324, 2025

Pith/arXiv arXiv 2025
[4]

Longcat-image technical report.arXiv preprint arXiv:2512.07584, 2025

Meituan LongCat Team, Hanghang Ma, Haoxian Tan, Jiale Huang, Junqiang Wu, Jun-Yan He, Lishuai Gao, Songlin Xiao, Xiaoming Wei, Xiaoqi Ma, et al. Longcat-image technical report.arXiv preprint arXiv:2512.07584, 2025

Pith/arXiv arXiv 2025
[5]

Scaling rectified flow transformers for high-resolution image synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis. InICML, 2024

2024
[6]

Ovis-u1 technical report.arXiv preprint arXiv:2506.23044, 2025

Guo-Hua Wang, Shanshan Zhao, Xinjie Zhang, Liangfu Cao, Pengxin Zhan, Lunhao Duan, Shiyin Lu, Minghao Fu, Xiaohao Chen, Jianshan Zhao, et al. Ovis-u1 technical report.arXiv preprint arXiv:2506.23044, 2025

arXiv 2025
[7]

Janus-pro: Unified multimodal understanding and generation with data and model scaling.arXiv preprint arXiv:2501.17811, 2025

Xiaokang Chen, Zhiyu Wu, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, and Chong Ruan. Janus-pro: Unified multimodal understanding and generation with data and model scaling.arXiv preprint arXiv:2501.17811, 2025

Pith/arXiv arXiv 2025
[8]

Yufeng Cui, Honghao Chen, Haoge Deng, Xu Huang, Xinghang Li, Jirong Liu, Yang Liu, Zhuoyan Luo, Jinsheng Wang, Wenxuan Wang, et al. Emu3. 5: Native multimodal models are world learners.arXiv preprint arXiv:2510.26583, 2025

Pith/arXiv arXiv 2025
[9]

Nano banana

Google. Nano banana. 2025

2025
[10]

Emerging properties in unified multimodal pretraining.arXiv preprint arXiv:2505.14683, 2025

Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, et al. Emerging properties in unified multimodal pretraining.arXiv preprint arXiv:2505.14683, 2025

Pith/arXiv arXiv 2025
[11]

Omnigen2: Exploration to advanced multimodal generation.arXiv preprint arXiv:2506.18871, 2025

Chenyuan Wu, Pengfei Zheng, Ruiran Yan, Shitao Xiao, Xin Luo, Yueze Wang, Wanli Li, Xiyan Jiang, Yexin Liu, Junjie Zhou, et al. Omnigen2: Exploration to advanced multimodal generation.arXiv preprint arXiv:2506.18871, 2025

Pith/arXiv arXiv 2025
[12]

Hunyuanimage 3.0 technical report.arXiv preprint arXiv:2509.23951, 2025

Siyu Cao, Hangting Chen, Peng Chen, Yiji Cheng, Yutao Cui, Xinchi Deng, Ying Dong, Kipper Gong, Tianpeng Gu, Xiusen Gu, et al. Hunyuanimage 3.0 technical report.arXiv preprint arXiv:2509.23951, 2025

Pith/arXiv arXiv 2025
[13]

Architecture decoupling is not all you need for unified multimodal model.arXiv preprint arXiv:2511.22663, 2025

Dian Zheng, Manyuan Zhang, Hongyu Li, Kai Zou, Hongbo Liu, Ziyu Guo, Kaituo Feng, Yexin Liu, Ying Luo, Yan Feng, et al. Architecture decoupling is not all you need for unified multimodal model.arXiv preprint arXiv:2511.22663, 2025

Pith/arXiv arXiv 2025
[14]

Denoising diffusion probabilistic models

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. InNeurIPS, 2020

2020
[15]

Auto-encoding variational bayes.arXiv preprint arXiv:1312.6114, 2013

Diederik P Kingma and Max Welling. Auto-encoding variational bayes.arXiv preprint arXiv:1312.6114, 2013

Pith/arXiv arXiv 2013
[16]

Adding conditional control to text-to-image diffusion models

Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. InICCV, 2023

2023
[17]

Qwen2.5-vl technical report.arXiv preprint arXiv:2502.13923, 2025

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report.ar...

Pith/arXiv arXiv 2025
[18]

Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

Pith/arXiv arXiv 2025
[19]

Sdxl: Improving latent diffusion models for high-resolution image synthesis.arXiv preprint arXiv:2307.01952, 2023

Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis.arXiv preprint arXiv:2307.01952, 2023

Pith/arXiv arXiv 2023
[20]

Gen-searcher: Reinforcing agentic search for image generation.arXiv preprint arXiv:2603.28767, 2026

Kaituo Feng, Manyuan Zhang, Shuang Chen, Yunlong Lin, Kaixuan Fan, Yilei Jiang, Hongyu Li, Dian Zheng, Chenyang Wang, and Xiangyu Yue. Gen-searcher: Reinforcing agentic search for image generation.arXiv preprint arXiv:2603.28767, 2026

Pith/arXiv arXiv 2026
[21]

Step1x-edit: A practical framework for general image editing.arXiv preprint arXiv:2504.17761, 2025

Shiyu Liu, Yucheng Han, Peng Xing, Fukun Yin, Rui Wang, Wei Cheng, Jiaqi Liao, Yingming Wang, Honghao Fu, Chunrui Han, et al. Step1x-edit: A practical framework for general image editing.arXiv preprint arXiv:2504.17761, 2025

Pith/arXiv arXiv 2025
[22]

Glm-image.https://huggingface.co/zai-org/GLM-Image, 2026

Zhipu AI. Glm-image.https://huggingface.co/zai-org/GLM-Image, 2026

2026
[23]

Editthinker: Unlocking iterative reasoning for any image editor.arXiv preprint arXiv:2512.05965, 2025

Hongyu Li, Manyuan Zhang, Dian Zheng, Ziyu Guo, Yimeng Jia, Kaituo Feng, Hao Yu, Yexin Liu, Yan Feng, Peng Pei, et al. Editthinker: Unlocking iterative reasoning for any image editor.arXiv preprint arXiv:2512.05965, 2025. 12 InterleaveThinker: Reinforcing Agentic Interleaved Generation

arXiv 2025
[24]

Z-image: An efficient image generation foundation model with single-stream diffusion transformer.arXiv preprint arXiv:2511.22699, 2025

Huanqia Cai, Sihan Cao, Ruoyi Du, Peng Gao, Steven Hoi, Zhaohui Hou, Shijie Huang, Dengyang Jiang, Xin Jin, Liangchen Li, et al. Z-image: An efficient image generation foundation model with single-stream diffusion transformer.arXiv preprint arXiv:2511.22699, 2025

Pith/arXiv arXiv 2025
[25]

Nano-banana-pro

Google. Nano-banana-pro. Accessed November, 2025 [Online] https://deepmind.google/models/ gemini-image/pro/, 2025

2025
[26]

Longcat-next: Lexicalizing modalities as discrete tokens.arXiv preprint arXiv:2603.27538, 2026

Meituan LongCat Team, Bin Xiao, Chao Wang, Chengjiang Li, Chi Zhang, Chong Peng, Hang Yu, Hao Yang, Haonan Yan, Haoze Sun, et al. Longcat-next: Lexicalizing modalities as discrete tokens.arXiv preprint arXiv:2603.27538, 2026

arXiv 2026
[27]

Emu3: Next-token prediction is all you need.arXiv preprint arXiv:2409.18869, 2024

Xinlong Wang, Xiaosong Zhang, Zhengxiong Luo, Quan Sun, Yufeng Cui, Jinsheng Wang, Fan Zhang, Yueze Wang, Zhen Li, Qiying Yu, et al. Emu3: Next-token prediction is all you need.arXiv preprint arXiv:2409.18869, 2024

Pith/arXiv arXiv 2024
[28]

Uni-edit: Intelligent editing is a general task for unified model tuning.arXiv preprint arXiv:2605.21487, 2026

Dian Zheng, Manyuan Zhang, Hongyu Li, Hongbo Liu, Kai Zou, Kaituo Feng, and Hongsheng Li. Uni-edit: Intelligent editing is a general task for unified model tuning.arXiv preprint arXiv:2605.21487, 2026

Pith/arXiv arXiv 2026
[29]

Duogen: Towards general purpose interleaved multimodal generation

Min Shi, Xiaohui Zeng, Jiannan Huang, Yin Cui, Francesco Ferroni, Jialuo Li, Shubham Pachori, Zhaoshuo Li, Yogesh Balaji, Haoxiang Wang, Tsung-Yi Lin, Xiao Fu, Yue Zhao, Chieh-Yun Chen, Ming-Yu Liu, and Humphrey Shi. Duogen: Towards general purpose interleaved multimodal generation. InCVPR, 2026

2026
[30]

Agentic reinforced policy optimization.arXiv preprint arXiv:2507.19849, 2025

Guanting Dong, Hangyu Mao, Kai Ma, Licheng Bao, Yifei Chen, Zhongyuan Wang, Zhongxia Chen, Jiazhen Du, Huiyang Wang, Fuzheng Zhang, et al. Agentic reinforced policy optimization.arXiv preprint arXiv:2507.19849, 2025

Pith/arXiv arXiv 2025
[31]

Vision-deepresearch: Incentivizing deepresearch capability in multimodal large language models.arXiv preprint arXiv:2601.22060, 2026

Wenxuan Huang, Yu Zeng, Qiuchen Wang, Zhen Fang, Shaosheng Cao, Zheng Chu, Qingyu Yin, Shuang Chen, Zhenfei Yin, Lin Chen, et al. Vision-deepresearch: Incentivizing deepresearch capability in multimodal large language models.arXiv preprint arXiv:2601.22060, 2026

arXiv 2026
[32]

Insight-v++: Towards advanced long-chain visual reasoning with multimodal large language models.arXiv preprint arXiv:2603.18118, 2026

Yuhao Dong, Zuyan Liu, Shulin Tian, Yongming Rao, and Ziwei Liu. Insight-v++: Towards advanced long-chain visual reasoning with multimodal large language models.arXiv preprint arXiv:2603.18118, 2026

arXiv 2026
[33]

Self-refine: Iterative refinement with self-feedback

Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, et al. Self-refine: Iterative refinement with self-feedback. InNeurIPS, 2023

2023
[34]

Reflexion: Language agents with verbal reinforcement learning

Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning. InNeurIPS, 2023

2023
[35]

Genartist: Multimodal llm as an agent for unified image generation and editing

Zhenyu Wang, Aoxue Li, Zhenguo Li, and Xihui Liu. Genartist: Multimodal llm as an agent for unified image generation and editing. InNeurIPS, 2024

2024
[36]

Idea2img: Iterative self-refinement with gpt-4v for automatic image design and generation

Zhengyuan Yang, Jianfeng Wang, Linjie Li, Kevin Lin, Chung-Ching Lin, Zicheng Liu, and Lijuan Wang. Idea2img: Iterative self-refinement with gpt-4v for automatic image design and generation. InECCV, 2024

2024
[37]

Reflect-dit: Inference-time scaling for text-to-image diffusion transformers via in-context reflection

Shufan Li, Konstantinos Kallidromitis, Akash Gokul, Arsh Koneru, Yusuke Kato, Kazuki Kozuka, and Aditya Grover. Reflect-dit: Inference-time scaling for text-to-image diffusion transformers via in-context reflection. InICCV, 2025

2025
[38]

From reflection to perfection: Scaling inference-time optimization for text-to-image diffusion models via reflection tuning

Le Zhuo, Liangbing Zhao, Sayak Paul, Yue Liao, Renrui Zhang, Yi Xin, Peng Gao, Mohamed Elhoseiny, and Hongsheng Li. From reflection to perfection: Scaling inference-time optimization for text-to-image diffusion models via reflection tuning. In ICCV, 2025

2025
[39]

Reasonedit: Towards reasoning-enhanced image editing models.arXiv preprint arXiv:2511.22625, 2025

Fukun Yin, Shiyu Liu, Yucheng Han, Zhibo Wang, Peng Xing, Rui Wang, Wei Cheng, Yingming Wang, Aojie Li, Zixin Yin, et al. Reasonedit: Towards reasoning-enhanced image editing models.arXiv preprint arXiv:2511.22625, 2025

arXiv 2025
[40]

Thinkrl-edit: Thinking in reinforcement learning for reasoning-centric image editing.arXiv preprint arXiv:2601.03467, 2026

Hengjia Li, Liming Jiang, Qing Yan, Yizhi Song, Hao Kang, Zichuan Liu, Xin Lu, Boxi Wu, and Deng Cai. Thinkrl-edit: Thinking in reinforcement learning for reasoning-centric image editing.arXiv preprint arXiv:2601.03467, 2026

arXiv 2026
[41]

Gemini 2.5 pro.https://deepmind.google/models/gemini/pro/, 2025

Google DeepMind. Gemini 2.5 pro.https://deepmind.google/models/gemini/pro/, 2025

2025
[42]

Viescore: Towards explainable metrics for conditional image synthesis evaluation

Max Ku, Dongfu Jiang, Cong Wei, Xiang Yue, and Wenhu Chen. Viescore: Towards explainable metrics for conditional image synthesis evaluation. InACL, 2024

2024
[43]

Comm: A coherent interleaved image-text dataset for multimodal understanding and generation

Wei Chen, Lin Li, Yongqi Yang, Bin Wen, Fan Yang, Tingting Gao, Yu Wu, and Long Chen. Comm: A coherent interleaved image-text dataset for multimodal understanding and generation. InCVPR, 2025

2025
[44]

Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

Pith/arXiv arXiv 2025
[45]

Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

Pith/arXiv arXiv 2025
[46]

Ueval: A benchmark for unified multimodal generation.arXiv preprint arXiv:2601.22155, 2026

Bo Li, Yida Yin, Wenhao Chai, Xingyu Fu, and Zhuang Liu. Ueval: A benchmark for unified multimodal generation.arXiv preprint arXiv:2601.22155, 2026

arXiv 2026
[47]

Experiment with gemini 2.0 flash native image generation, march 2025.URL https://developers

Kat Kampf and Nicole Brichtova. Experiment with gemini 2.0 flash native image generation, march 2025.URL https://developers. googleblog. com/en/experiment-with-gemini-20-flash-native-image-generation/. Accessed, 2025

2025
[48]

Show-o2: Improved native unified multimodal models

Jinheng Xie, Zhenheng Yang, and Mike Zheng Shou. Show-o2: Improved native unified multimodal models. InNeurIPS, 2025. 13 InterleaveThinker: Reinforcing Agentic Interleaved Generation

2025
[49]

Mmada: Multimodal large diffusion language models.arXiv preprint arXiv:2505.15809, 2025

Ling Yang, Ye Tian, Bowen Li, Xinchen Zhang, Ke Shen, Yunhai Tong, and Mengdi Wang. Mmada: Multimodal large diffusion language models.arXiv preprint arXiv:2505.15809, 2025

Pith/arXiv arXiv 2025
[50]

Wise: A world knowledge-informed semantic evaluation for text-to-image generation.arXiv preprint arXiv:2503.07265, 2025

Yuwei Niu, Munan Ning, Mengren Zheng, Weiyang Jin, Bin Lin, Peng Jin, Jiaqi Liao, Chaoran Feng, Kunpeng Ning, Bin Zhu, et al. Wise: A world knowledge-informed semantic evaluation for text-to-image generation.arXiv preprint arXiv:2503.07265, 2025

Pith/arXiv arXiv 2025
[51]

Envisioning beyond the pixels: Benchmarking reasoning-informed visual editing.arXiv preprint arXiv:2504.02826, 2025

Xiangyu Zhao, Peiyuan Zhang, Kexian Tang, Xiaorong Zhu, Hao Li, Wenhao Chai, Zicheng Zhang, Renqiu Xia, Guangtao Zhai, Junchi Yan, et al. Envisioning beyond the pixels: Benchmarking reasoning-informed visual editing.arXiv preprint arXiv:2504.02826, 2025

arXiv 2025
[52]

Minigpt-5: Interleaved vision-and-language generation via generative vokens

Kaizhi Zheng, Xuehai He, and Xin Eric Wang. Minigpt-5: Interleaved vision-and-language generation via generative vokens. arXiv preprint arXiv:2310.02239, 2023

arXiv 2023
[53]

Making llama see and draw with seed tokenizer.arXiv preprint arXiv:2310.01218, 2023

Yuying Ge, Sijie Zhao, Ziyun Zeng, Yixiao Ge, Chen Li, Xintao Wang, and Ying Shan. Making llama see and draw with seed tokenizer.arXiv preprint arXiv:2310.01218, 2023

arXiv 2023
[54]

Generative multimodal models are in-context learners

Quan Sun, Yufeng Cui, Xiaosong Zhang, Fan Zhang, Qiying Yu, Yueze Wang, Yongming Rao, Jingjing Liu, Tiejun Huang, and Xinlong Wang. Generative multimodal models are in-context learners. InCVPR, 2024

2024
[55]

Gpt-image-1

OpenAI. Gpt-image-1. 2025

2025
[56]

Stable diffusion 3.5 large

Stability AI. Stable diffusion 3.5 large. https://huggingface.co/stabilityai/stable-diffusion-3. 5-large, 2024

2024
[57]

Seedream 4.0

ByteDance. Seedream 4.0. 2025

2025
[58]

{text_input}

Stephen Batifol, Andreas Blattmann, Frederic Boesel, Saksham Consul, Cyril Diagne, Tim Dockhorn, Jack English, Zion English, Patrick Esser, Sumith Kulal, et al. Flux. 1 kontext: Flow matching for in-context image generation and editing in latent space.arXiv e-prints, 2025. 14 InterleaveThinker: Reinforcing Agentic Interleaved Generation A System Prompt Pl...

2025
[59]

Every step in your execution plan MUST represent an actual image generation or image editing action

**Dynamic Step Count (Image Operations Only)**: Determine the necessary number of steps. Every step in your execution plan MUST represent an actual image generation or image editing action. **DO NOT** create separate steps solely for generating text, captions, or summaries
[60]

For visual or creative tasks, the final step MUST result in a fully colored, detailed, and polished output

**Complete & Polished Output**: Always aim for a fully realized final product. For visual or creative tasks, the final step MUST result in a fully colored, detailed, and polished output. Do not stop at a draft, outline, or uncolored sketch unless the user explicitly requests it
[61]

**Text Generation & Auxiliary Text Rule**: - If the user specifically asks to render or draw text *inside* the image, include this requirement within the `instruction`field. - If the user explicitly asks for a *separate* text response (e.g., a caption, summary, explanation, or knowledge grounding) to accompany the image, generate this text and place it in...
[62]

add a red hat

**Prompt Optimization for All Steps**: Convert the`instruction`of EVERY step into a highly effective prompt in the `prompt`field. - **Step 1 (Generation)**: Create a highly detailed T2I prompt representing the foundational stage. Focus *only* on the Step 1 instruction. Do NOT hallucinate unmentioned details or future elements. - **Subsequent Steps (Editin...
[63]

Step 1:",

**CRITICAL**: The`prompt`field MUST contain ONLY the pure text prompt or editing instruction. DO NOT include meta-text, prefixes (such as "Step 1:", "Prompt:", "Edit:"), or conversational filler. It must be directly usable by the generation/editing API. ## Output The output consists of two parts:
[65]

The optimized, pure T2I prompt suitable for the image generation model. (No 'Step 1:' prefix)

A JSON -- Planing each step and rewrite the instruction to prompt suitable for generation/editing. Here is a output example <think> Part 1: Planning analysis explaining the execution plan. Part 2: Analysis of how the instructions were translated into visual keywords for the T2I prompt and editing instructions. </think> <answer> { 'execution_plan': [ {'ste...
[66]

**Task Identification & Modality Routing**: Carefully analyze the input to determine the task type. - **Task A (General Text Response / Problem Solving / Image-to-Text)**: If the user provides a complete sequence of images and asks for text responses for each step (e.g., describing the images, solving a problem, explaining a process, or answering question...
[67]

(3)", "Step 3:

**Strict Step Count & NO Prefix Rule**: - **Step Count**: Determine the logical number of steps. **CRITICAL**: If the user's input explicitly specifies the number of steps required, you MUST strictly output exactly that number of steps to fulfill the requirement. If continuing a sequence (Task B), your`step_number`MUST start exactly from where the user's ...
[68]

You MUST set this to`null` for Task A

**Field Definitions & Usage**: -`instruction`: The detailed, pure text content or action for the editing step (Task B). You MUST set this to`null` for Task A. (Strictly NO step prefixes). -`prompt`: The optimized, pure instruction suitable for the **image editing model** to execute the change based on the previous image (Task B). You MUST set this to`null...
[69]

## Output The output consists of two parts:

**Complete Output**: Ensure the final step achieves a complete resolution of the user's goal based on the sequence context. ## Output The output consists of two parts:
[70]

A Statement - Just an dummy reasoning
[71]

Detailed instruction for this step (Task B). Output null if this is Task A. Strictly NO prefixes like 'Step i:' or '(i)'

A JSON -- Planing each step and rewrite the instruction to prompt suitable for generation/editing. Here is a output example <think> </think> <answer> { 'execution_plan': [ {'step_number': i, 'step_name': 'Short name for the step', 'instruction': "Detailed instruction for this step (Task B). Output null if this is Task A. Strictly NO prefixes like 'Step i:...
[72]

Evaluate the edited image and output the result in boolean format (True/False)
[73]

{original_instruction}

If you think the edited image is not good enough (False), generate an optimized rewritten prompt that addresses the original shortcomings; if you think it is good enough (True), output the [Original Rewritten Prompt]. ## Input Information You have been presented with two images in sequence: - Original Image: The input image before editing. (NOTE: For the ...
[74]

Otherwise, observe the delta (differences)

**Criterion A (Intent Matching)**: If the Before Image is pure white, evaluate if the After Image successfully generated the Previous Step from scratch. Otherwise, observe the delta (differences). Did the changes match the key meaning and necessary details of the Previous Step?
[75]

Fault Finder

**Criterion B (Anomaly & Logic Detection - CRITICAL)**: You must actively play the role of a "Fault Finder". Do NOT just check if the requested object exists; you MUST check HOW it exists. Scan the After Image for any of the following fatal errors: - **Anatomical/Biological Errors**: Extra/missing limbs or fingers, body parts emerging from impossible or a...
[76]

*(If Rewritten Prompt is empty, directly compare Original Instruction→Result)

**What went wrong?** - Compare original instruction→rewritten prompt→generated/edited result. *(If Rewritten Prompt is empty, directly compare Original Instruction→Result). * - Identify gaps between intent and execution - Determine if the issue is clarity, specificity, or contradiction
[77]

maintain [aspect]

**Refinement Approaches:** **If this is an Initial Generation task (Before image was blank):** - **Establish Foundation:** Translate the raw user instruction into a comprehensive Text-to-Image prompt. - **Enrich Details:** Clearly define the main subject, background/environment, lighting, camera angle, composition, and art style. - **Prevent Ambiguity:** ...
[78]

**Leverage All Information:** - Reference what's visible in the original image - Learn from what the previous rewritten prompt missed - Use the edited image as feedback on what went wrong - Maintain what worked, fix what didn't ## Output The output consists of three parts:
[79]

A Statement - Analysis process and reasoning
[80]

A Boolean - Judge whether the edited images is good enough
[81]

Fault Finder

A prompt -- either the optimized rewritten prompt or the original rewritten prompt. Here is a output example: <think> Detailed explanation of evaluation and new rewritten prompt. If edited image is good enough, explain why it meets requirements. If not good enough, explain specific shortcomings. </think> <answer> { 'previous_step_success': 'boolean (True ...

Showing first 80 references.

[1] [1]

Flux.https://github.com/black-forest-labs/flux, 2024

Black Forest Labs. Flux.https://github.com/black-forest-labs/flux, 2024

2024

[2] [2]

FLUX.2: Frontier Visual Intelligence.https://bfl.ai/blog/flux-2, 2025

Black Forest Labs. FLUX.2: Frontier Visual Intelligence.https://bfl.ai/blog/flux-2, 2025

2025

[3] [3]

Qwen-image technical report.arXiv preprint arXiv:2508.02324, 2025

Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al. Qwen-image technical report.arXiv preprint arXiv:2508.02324, 2025

Pith/arXiv arXiv 2025

[4] [4]

Longcat-image technical report.arXiv preprint arXiv:2512.07584, 2025

Meituan LongCat Team, Hanghang Ma, Haoxian Tan, Jiale Huang, Junqiang Wu, Jun-Yan He, Lishuai Gao, Songlin Xiao, Xiaoming Wei, Xiaoqi Ma, et al. Longcat-image technical report.arXiv preprint arXiv:2512.07584, 2025

Pith/arXiv arXiv 2025

[5] [5]

Scaling rectified flow transformers for high-resolution image synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis. InICML, 2024

2024

[6] [6]

Ovis-u1 technical report.arXiv preprint arXiv:2506.23044, 2025

Guo-Hua Wang, Shanshan Zhao, Xinjie Zhang, Liangfu Cao, Pengxin Zhan, Lunhao Duan, Shiyin Lu, Minghao Fu, Xiaohao Chen, Jianshan Zhao, et al. Ovis-u1 technical report.arXiv preprint arXiv:2506.23044, 2025

arXiv 2025

[7] [7]

Janus-pro: Unified multimodal understanding and generation with data and model scaling.arXiv preprint arXiv:2501.17811, 2025

Xiaokang Chen, Zhiyu Wu, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, and Chong Ruan. Janus-pro: Unified multimodal understanding and generation with data and model scaling.arXiv preprint arXiv:2501.17811, 2025

Pith/arXiv arXiv 2025

[8] [8]

Yufeng Cui, Honghao Chen, Haoge Deng, Xu Huang, Xinghang Li, Jirong Liu, Yang Liu, Zhuoyan Luo, Jinsheng Wang, Wenxuan Wang, et al. Emu3. 5: Native multimodal models are world learners.arXiv preprint arXiv:2510.26583, 2025

Pith/arXiv arXiv 2025

[9] [9]

Nano banana

Google. Nano banana. 2025

2025

[10] [10]

Emerging properties in unified multimodal pretraining.arXiv preprint arXiv:2505.14683, 2025

Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, et al. Emerging properties in unified multimodal pretraining.arXiv preprint arXiv:2505.14683, 2025

Pith/arXiv arXiv 2025

[11] [11]

Omnigen2: Exploration to advanced multimodal generation.arXiv preprint arXiv:2506.18871, 2025

Chenyuan Wu, Pengfei Zheng, Ruiran Yan, Shitao Xiao, Xin Luo, Yueze Wang, Wanli Li, Xiyan Jiang, Yexin Liu, Junjie Zhou, et al. Omnigen2: Exploration to advanced multimodal generation.arXiv preprint arXiv:2506.18871, 2025

Pith/arXiv arXiv 2025

[12] [12]

Hunyuanimage 3.0 technical report.arXiv preprint arXiv:2509.23951, 2025

Siyu Cao, Hangting Chen, Peng Chen, Yiji Cheng, Yutao Cui, Xinchi Deng, Ying Dong, Kipper Gong, Tianpeng Gu, Xiusen Gu, et al. Hunyuanimage 3.0 technical report.arXiv preprint arXiv:2509.23951, 2025

Pith/arXiv arXiv 2025

[13] [13]

Architecture decoupling is not all you need for unified multimodal model.arXiv preprint arXiv:2511.22663, 2025

Dian Zheng, Manyuan Zhang, Hongyu Li, Kai Zou, Hongbo Liu, Ziyu Guo, Kaituo Feng, Yexin Liu, Ying Luo, Yan Feng, et al. Architecture decoupling is not all you need for unified multimodal model.arXiv preprint arXiv:2511.22663, 2025

Pith/arXiv arXiv 2025

[14] [14]

Denoising diffusion probabilistic models

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. InNeurIPS, 2020

2020

[15] [15]

Auto-encoding variational bayes.arXiv preprint arXiv:1312.6114, 2013

Diederik P Kingma and Max Welling. Auto-encoding variational bayes.arXiv preprint arXiv:1312.6114, 2013

Pith/arXiv arXiv 2013

[16] [16]

Adding conditional control to text-to-image diffusion models

Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. InICCV, 2023

2023

[17] [17]

Qwen2.5-vl technical report.arXiv preprint arXiv:2502.13923, 2025

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report.ar...

Pith/arXiv arXiv 2025

[18] [18]

Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

Pith/arXiv arXiv 2025

[19] [19]

Sdxl: Improving latent diffusion models for high-resolution image synthesis.arXiv preprint arXiv:2307.01952, 2023

Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis.arXiv preprint arXiv:2307.01952, 2023

Pith/arXiv arXiv 2023

[20] [20]

Gen-searcher: Reinforcing agentic search for image generation.arXiv preprint arXiv:2603.28767, 2026

Kaituo Feng, Manyuan Zhang, Shuang Chen, Yunlong Lin, Kaixuan Fan, Yilei Jiang, Hongyu Li, Dian Zheng, Chenyang Wang, and Xiangyu Yue. Gen-searcher: Reinforcing agentic search for image generation.arXiv preprint arXiv:2603.28767, 2026

Pith/arXiv arXiv 2026

[21] [21]

Step1x-edit: A practical framework for general image editing.arXiv preprint arXiv:2504.17761, 2025

Shiyu Liu, Yucheng Han, Peng Xing, Fukun Yin, Rui Wang, Wei Cheng, Jiaqi Liao, Yingming Wang, Honghao Fu, Chunrui Han, et al. Step1x-edit: A practical framework for general image editing.arXiv preprint arXiv:2504.17761, 2025

Pith/arXiv arXiv 2025

[22] [22]

Glm-image.https://huggingface.co/zai-org/GLM-Image, 2026

Zhipu AI. Glm-image.https://huggingface.co/zai-org/GLM-Image, 2026

2026

[23] [23]

Editthinker: Unlocking iterative reasoning for any image editor.arXiv preprint arXiv:2512.05965, 2025

Hongyu Li, Manyuan Zhang, Dian Zheng, Ziyu Guo, Yimeng Jia, Kaituo Feng, Hao Yu, Yexin Liu, Yan Feng, Peng Pei, et al. Editthinker: Unlocking iterative reasoning for any image editor.arXiv preprint arXiv:2512.05965, 2025. 12 InterleaveThinker: Reinforcing Agentic Interleaved Generation

arXiv 2025

[24] [24]

Z-image: An efficient image generation foundation model with single-stream diffusion transformer.arXiv preprint arXiv:2511.22699, 2025

Huanqia Cai, Sihan Cao, Ruoyi Du, Peng Gao, Steven Hoi, Zhaohui Hou, Shijie Huang, Dengyang Jiang, Xin Jin, Liangchen Li, et al. Z-image: An efficient image generation foundation model with single-stream diffusion transformer.arXiv preprint arXiv:2511.22699, 2025

Pith/arXiv arXiv 2025

[25] [25]

Nano-banana-pro

Google. Nano-banana-pro. Accessed November, 2025 [Online] https://deepmind.google/models/ gemini-image/pro/, 2025

2025

[26] [26]

Longcat-next: Lexicalizing modalities as discrete tokens.arXiv preprint arXiv:2603.27538, 2026

Meituan LongCat Team, Bin Xiao, Chao Wang, Chengjiang Li, Chi Zhang, Chong Peng, Hang Yu, Hao Yang, Haonan Yan, Haoze Sun, et al. Longcat-next: Lexicalizing modalities as discrete tokens.arXiv preprint arXiv:2603.27538, 2026

arXiv 2026

[27] [27]

Emu3: Next-token prediction is all you need.arXiv preprint arXiv:2409.18869, 2024

Xinlong Wang, Xiaosong Zhang, Zhengxiong Luo, Quan Sun, Yufeng Cui, Jinsheng Wang, Fan Zhang, Yueze Wang, Zhen Li, Qiying Yu, et al. Emu3: Next-token prediction is all you need.arXiv preprint arXiv:2409.18869, 2024

Pith/arXiv arXiv 2024

[28] [28]

Uni-edit: Intelligent editing is a general task for unified model tuning.arXiv preprint arXiv:2605.21487, 2026

Dian Zheng, Manyuan Zhang, Hongyu Li, Hongbo Liu, Kai Zou, Kaituo Feng, and Hongsheng Li. Uni-edit: Intelligent editing is a general task for unified model tuning.arXiv preprint arXiv:2605.21487, 2026

Pith/arXiv arXiv 2026

[29] [29]

Duogen: Towards general purpose interleaved multimodal generation

Min Shi, Xiaohui Zeng, Jiannan Huang, Yin Cui, Francesco Ferroni, Jialuo Li, Shubham Pachori, Zhaoshuo Li, Yogesh Balaji, Haoxiang Wang, Tsung-Yi Lin, Xiao Fu, Yue Zhao, Chieh-Yun Chen, Ming-Yu Liu, and Humphrey Shi. Duogen: Towards general purpose interleaved multimodal generation. InCVPR, 2026

2026

[30] [30]

Agentic reinforced policy optimization.arXiv preprint arXiv:2507.19849, 2025

Guanting Dong, Hangyu Mao, Kai Ma, Licheng Bao, Yifei Chen, Zhongyuan Wang, Zhongxia Chen, Jiazhen Du, Huiyang Wang, Fuzheng Zhang, et al. Agentic reinforced policy optimization.arXiv preprint arXiv:2507.19849, 2025

Pith/arXiv arXiv 2025

[31] [31]

Vision-deepresearch: Incentivizing deepresearch capability in multimodal large language models.arXiv preprint arXiv:2601.22060, 2026

Wenxuan Huang, Yu Zeng, Qiuchen Wang, Zhen Fang, Shaosheng Cao, Zheng Chu, Qingyu Yin, Shuang Chen, Zhenfei Yin, Lin Chen, et al. Vision-deepresearch: Incentivizing deepresearch capability in multimodal large language models.arXiv preprint arXiv:2601.22060, 2026

arXiv 2026

[32] [32]

Insight-v++: Towards advanced long-chain visual reasoning with multimodal large language models.arXiv preprint arXiv:2603.18118, 2026

Yuhao Dong, Zuyan Liu, Shulin Tian, Yongming Rao, and Ziwei Liu. Insight-v++: Towards advanced long-chain visual reasoning with multimodal large language models.arXiv preprint arXiv:2603.18118, 2026

arXiv 2026

[33] [33]

Self-refine: Iterative refinement with self-feedback

Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, et al. Self-refine: Iterative refinement with self-feedback. InNeurIPS, 2023

2023

[34] [34]

Reflexion: Language agents with verbal reinforcement learning

Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning. InNeurIPS, 2023

2023

[35] [35]

Genartist: Multimodal llm as an agent for unified image generation and editing

Zhenyu Wang, Aoxue Li, Zhenguo Li, and Xihui Liu. Genartist: Multimodal llm as an agent for unified image generation and editing. InNeurIPS, 2024

2024

[36] [36]

Idea2img: Iterative self-refinement with gpt-4v for automatic image design and generation

Zhengyuan Yang, Jianfeng Wang, Linjie Li, Kevin Lin, Chung-Ching Lin, Zicheng Liu, and Lijuan Wang. Idea2img: Iterative self-refinement with gpt-4v for automatic image design and generation. InECCV, 2024

2024

[37] [37]

Reflect-dit: Inference-time scaling for text-to-image diffusion transformers via in-context reflection

Shufan Li, Konstantinos Kallidromitis, Akash Gokul, Arsh Koneru, Yusuke Kato, Kazuki Kozuka, and Aditya Grover. Reflect-dit: Inference-time scaling for text-to-image diffusion transformers via in-context reflection. InICCV, 2025

2025

[38] [38]

From reflection to perfection: Scaling inference-time optimization for text-to-image diffusion models via reflection tuning

Le Zhuo, Liangbing Zhao, Sayak Paul, Yue Liao, Renrui Zhang, Yi Xin, Peng Gao, Mohamed Elhoseiny, and Hongsheng Li. From reflection to perfection: Scaling inference-time optimization for text-to-image diffusion models via reflection tuning. In ICCV, 2025

2025

[39] [39]

Reasonedit: Towards reasoning-enhanced image editing models.arXiv preprint arXiv:2511.22625, 2025

Fukun Yin, Shiyu Liu, Yucheng Han, Zhibo Wang, Peng Xing, Rui Wang, Wei Cheng, Yingming Wang, Aojie Li, Zixin Yin, et al. Reasonedit: Towards reasoning-enhanced image editing models.arXiv preprint arXiv:2511.22625, 2025

arXiv 2025

[40] [40]

Thinkrl-edit: Thinking in reinforcement learning for reasoning-centric image editing.arXiv preprint arXiv:2601.03467, 2026

Hengjia Li, Liming Jiang, Qing Yan, Yizhi Song, Hao Kang, Zichuan Liu, Xin Lu, Boxi Wu, and Deng Cai. Thinkrl-edit: Thinking in reinforcement learning for reasoning-centric image editing.arXiv preprint arXiv:2601.03467, 2026

arXiv 2026

[41] [41]

Gemini 2.5 pro.https://deepmind.google/models/gemini/pro/, 2025

Google DeepMind. Gemini 2.5 pro.https://deepmind.google/models/gemini/pro/, 2025

2025

[42] [42]

Viescore: Towards explainable metrics for conditional image synthesis evaluation

Max Ku, Dongfu Jiang, Cong Wei, Xiang Yue, and Wenhu Chen. Viescore: Towards explainable metrics for conditional image synthesis evaluation. InACL, 2024

2024

[43] [43]

Comm: A coherent interleaved image-text dataset for multimodal understanding and generation

Wei Chen, Lin Li, Yongqi Yang, Bin Wen, Fan Yang, Tingting Gao, Yu Wu, and Long Chen. Comm: A coherent interleaved image-text dataset for multimodal understanding and generation. InCVPR, 2025

2025

[44] [44]

Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

Pith/arXiv arXiv 2025

[45] [45]

Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

Pith/arXiv arXiv 2025

[46] [46]

Ueval: A benchmark for unified multimodal generation.arXiv preprint arXiv:2601.22155, 2026

Bo Li, Yida Yin, Wenhao Chai, Xingyu Fu, and Zhuang Liu. Ueval: A benchmark for unified multimodal generation.arXiv preprint arXiv:2601.22155, 2026

arXiv 2026

[47] [47]

Experiment with gemini 2.0 flash native image generation, march 2025.URL https://developers

Kat Kampf and Nicole Brichtova. Experiment with gemini 2.0 flash native image generation, march 2025.URL https://developers. googleblog. com/en/experiment-with-gemini-20-flash-native-image-generation/. Accessed, 2025

2025

[48] [48]

Show-o2: Improved native unified multimodal models

Jinheng Xie, Zhenheng Yang, and Mike Zheng Shou. Show-o2: Improved native unified multimodal models. InNeurIPS, 2025. 13 InterleaveThinker: Reinforcing Agentic Interleaved Generation

2025

[49] [49]

Mmada: Multimodal large diffusion language models.arXiv preprint arXiv:2505.15809, 2025

Ling Yang, Ye Tian, Bowen Li, Xinchen Zhang, Ke Shen, Yunhai Tong, and Mengdi Wang. Mmada: Multimodal large diffusion language models.arXiv preprint arXiv:2505.15809, 2025

Pith/arXiv arXiv 2025

[50] [50]

Wise: A world knowledge-informed semantic evaluation for text-to-image generation.arXiv preprint arXiv:2503.07265, 2025

Yuwei Niu, Munan Ning, Mengren Zheng, Weiyang Jin, Bin Lin, Peng Jin, Jiaqi Liao, Chaoran Feng, Kunpeng Ning, Bin Zhu, et al. Wise: A world knowledge-informed semantic evaluation for text-to-image generation.arXiv preprint arXiv:2503.07265, 2025

Pith/arXiv arXiv 2025

[51] [51]

Envisioning beyond the pixels: Benchmarking reasoning-informed visual editing.arXiv preprint arXiv:2504.02826, 2025

Xiangyu Zhao, Peiyuan Zhang, Kexian Tang, Xiaorong Zhu, Hao Li, Wenhao Chai, Zicheng Zhang, Renqiu Xia, Guangtao Zhai, Junchi Yan, et al. Envisioning beyond the pixels: Benchmarking reasoning-informed visual editing.arXiv preprint arXiv:2504.02826, 2025

arXiv 2025

[52] [52]

Minigpt-5: Interleaved vision-and-language generation via generative vokens

Kaizhi Zheng, Xuehai He, and Xin Eric Wang. Minigpt-5: Interleaved vision-and-language generation via generative vokens. arXiv preprint arXiv:2310.02239, 2023

arXiv 2023

[53] [53]

Making llama see and draw with seed tokenizer.arXiv preprint arXiv:2310.01218, 2023

Yuying Ge, Sijie Zhao, Ziyun Zeng, Yixiao Ge, Chen Li, Xintao Wang, and Ying Shan. Making llama see and draw with seed tokenizer.arXiv preprint arXiv:2310.01218, 2023

arXiv 2023

[54] [54]

Generative multimodal models are in-context learners

Quan Sun, Yufeng Cui, Xiaosong Zhang, Fan Zhang, Qiying Yu, Yueze Wang, Yongming Rao, Jingjing Liu, Tiejun Huang, and Xinlong Wang. Generative multimodal models are in-context learners. InCVPR, 2024

2024

[55] [55]

Gpt-image-1

OpenAI. Gpt-image-1. 2025

2025

[56] [56]

Stable diffusion 3.5 large

Stability AI. Stable diffusion 3.5 large. https://huggingface.co/stabilityai/stable-diffusion-3. 5-large, 2024

2024

[57] [57]

Seedream 4.0

ByteDance. Seedream 4.0. 2025

2025

[58] [58]

{text_input}

Stephen Batifol, Andreas Blattmann, Frederic Boesel, Saksham Consul, Cyril Diagne, Tim Dockhorn, Jack English, Zion English, Patrick Esser, Sumith Kulal, et al. Flux. 1 kontext: Flow matching for in-context image generation and editing in latent space.arXiv e-prints, 2025. 14 InterleaveThinker: Reinforcing Agentic Interleaved Generation A System Prompt Pl...

2025

[59] [59]

Every step in your execution plan MUST represent an actual image generation or image editing action

**Dynamic Step Count (Image Operations Only)**: Determine the necessary number of steps. Every step in your execution plan MUST represent an actual image generation or image editing action. **DO NOT** create separate steps solely for generating text, captions, or summaries

[60] [60]

For visual or creative tasks, the final step MUST result in a fully colored, detailed, and polished output

**Complete & Polished Output**: Always aim for a fully realized final product. For visual or creative tasks, the final step MUST result in a fully colored, detailed, and polished output. Do not stop at a draft, outline, or uncolored sketch unless the user explicitly requests it

[61] [61]

**Text Generation & Auxiliary Text Rule**: - If the user specifically asks to render or draw text *inside* the image, include this requirement within the `instruction`field. - If the user explicitly asks for a *separate* text response (e.g., a caption, summary, explanation, or knowledge grounding) to accompany the image, generate this text and place it in...

[62] [62]

add a red hat

**Prompt Optimization for All Steps**: Convert the`instruction`of EVERY step into a highly effective prompt in the `prompt`field. - **Step 1 (Generation)**: Create a highly detailed T2I prompt representing the foundational stage. Focus *only* on the Step 1 instruction. Do NOT hallucinate unmentioned details or future elements. - **Subsequent Steps (Editin...

[63] [63]

Step 1:",

**CRITICAL**: The`prompt`field MUST contain ONLY the pure text prompt or editing instruction. DO NOT include meta-text, prefixes (such as "Step 1:", "Prompt:", "Edit:"), or conversational filler. It must be directly usable by the generation/editing API. ## Output The output consists of two parts:

[64] [65]

The optimized, pure T2I prompt suitable for the image generation model. (No 'Step 1:' prefix)

A JSON -- Planing each step and rewrite the instruction to prompt suitable for generation/editing. Here is a output example <think> Part 1: Planning analysis explaining the execution plan. Part 2: Analysis of how the instructions were translated into visual keywords for the T2I prompt and editing instructions. </think> <answer> { 'execution_plan': [ {'ste...

[65] [66]

**Task Identification & Modality Routing**: Carefully analyze the input to determine the task type. - **Task A (General Text Response / Problem Solving / Image-to-Text)**: If the user provides a complete sequence of images and asks for text responses for each step (e.g., describing the images, solving a problem, explaining a process, or answering question...

[66] [67]

(3)", "Step 3:

**Strict Step Count & NO Prefix Rule**: - **Step Count**: Determine the logical number of steps. **CRITICAL**: If the user's input explicitly specifies the number of steps required, you MUST strictly output exactly that number of steps to fulfill the requirement. If continuing a sequence (Task B), your`step_number`MUST start exactly from where the user's ...

[67] [68]

You MUST set this to`null` for Task A

**Field Definitions & Usage**: -`instruction`: The detailed, pure text content or action for the editing step (Task B). You MUST set this to`null` for Task A. (Strictly NO step prefixes). -`prompt`: The optimized, pure instruction suitable for the **image editing model** to execute the change based on the previous image (Task B). You MUST set this to`null...

[68] [69]

## Output The output consists of two parts:

**Complete Output**: Ensure the final step achieves a complete resolution of the user's goal based on the sequence context. ## Output The output consists of two parts:

[69] [70]

A Statement - Just an dummy reasoning

[70] [71]

Detailed instruction for this step (Task B). Output null if this is Task A. Strictly NO prefixes like 'Step i:' or '(i)'

A JSON -- Planing each step and rewrite the instruction to prompt suitable for generation/editing. Here is a output example <think> </think> <answer> { 'execution_plan': [ {'step_number': i, 'step_name': 'Short name for the step', 'instruction': "Detailed instruction for this step (Task B). Output null if this is Task A. Strictly NO prefixes like 'Step i:...

[71] [72]

Evaluate the edited image and output the result in boolean format (True/False)

[72] [73]

{original_instruction}

If you think the edited image is not good enough (False), generate an optimized rewritten prompt that addresses the original shortcomings; if you think it is good enough (True), output the [Original Rewritten Prompt]. ## Input Information You have been presented with two images in sequence: - Original Image: The input image before editing. (NOTE: For the ...

[73] [74]

Otherwise, observe the delta (differences)

**Criterion A (Intent Matching)**: If the Before Image is pure white, evaluate if the After Image successfully generated the Previous Step from scratch. Otherwise, observe the delta (differences). Did the changes match the key meaning and necessary details of the Previous Step?

[74] [75]

Fault Finder

**Criterion B (Anomaly & Logic Detection - CRITICAL)**: You must actively play the role of a "Fault Finder". Do NOT just check if the requested object exists; you MUST check HOW it exists. Scan the After Image for any of the following fatal errors: - **Anatomical/Biological Errors**: Extra/missing limbs or fingers, body parts emerging from impossible or a...

[75] [76]

*(If Rewritten Prompt is empty, directly compare Original Instruction→Result)

**What went wrong?** - Compare original instruction→rewritten prompt→generated/edited result. *(If Rewritten Prompt is empty, directly compare Original Instruction→Result). * - Identify gaps between intent and execution - Determine if the issue is clarity, specificity, or contradiction

[76] [77]

maintain [aspect]

**Refinement Approaches:** **If this is an Initial Generation task (Before image was blank):** - **Establish Foundation:** Translate the raw user instruction into a comprehensive Text-to-Image prompt. - **Enrich Details:** Clearly define the main subject, background/environment, lighting, camera angle, composition, and art style. - **Prevent Ambiguity:** ...

[77] [78]

**Leverage All Information:** - Reference what's visible in the original image - Learn from what the previous rewritten prompt missed - Use the edited image as feedback on what went wrong - Maintain what worked, fix what didn't ## Output The output consists of three parts:

[78] [79]

A Statement - Analysis process and reasoning

[79] [80]

A Boolean - Judge whether the edited images is good enough

[80] [81]

Fault Finder

A prompt -- either the optimized rewritten prompt or the original rewritten prompt. Here is a output example: <think> Detailed explanation of evaluation and new rewritten prompt. If edited image is good enough, explain why it meets requirements. If not good enough, explain specific shortcomings. </think> <answer> { 'previous_step_success': 'boolean (True ...