pith. sign in

arxiv: 2603.28767 · v3 · pith:66ZBWGSAnew · submitted 2026-03-30 · 💻 cs.CV

Gen-Searcher: Reinforcing Agentic Search for Image Generation

Pith reviewed 2026-05-25 06:28 UTC · model grok-4.3

classification 💻 cs.CV
keywords image generationsearch agentmulti-hop reasoningreinforcement learningknowledge groundingagentic searchdual reward
0
0 comments X

The pith

An image generation agent trained to perform multi-hop search for knowledge and reference images achieves substantial performance gains on knowledge-intensive tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that image generation models can be augmented with an agent capable of multi-hop reasoning and external search to overcome limitations of frozen internal knowledge. It does this by building specialized datasets for SFT and RL, then applying a dual-reward system in GRPO training that uses both text and image feedback. A new benchmark KnowGen is introduced to test models on prompts requiring search-grounded external knowledge. Sympathetic readers would care because this enables generation of accurate images for real-world scenarios involving up-to-date or specialized information. Experiments demonstrate around 16 point improvements on KnowGen and 15 on WISE when applied to base models like Qwen-Image.

Core claim

Gen-Searcher is presented as the first search-augmented image generation agent that performs multi-hop reasoning and search to collect textual knowledge and reference images needed for grounded generation. Two datasets are curated, Gen-Searcher-SFT-10k and Gen-Searcher-RL-6k, and the model is trained with SFT followed by agentic reinforcement learning using dual reward feedback in GRPO training. This yields substantial gains, improving Qwen-Image by around 16 points on KnowGen and 15 points on WISE.

What carries the argument

The dual-reward GRPO training combining text-based and image-based rewards to provide stable and informative learning signals for the agent performing multi-hop search.

If this is right

  • Image generators can handle knowledge-intensive or time-sensitive prompts by dynamically searching for information.
  • The approach provides a foundation for open development of search agents in visual generation tasks.
  • Dual reward feedback enables more effective reinforcement learning for agentic behaviors in generation.
  • New benchmarks like KnowGen allow systematic evaluation of search-grounded image generation capabilities.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Integrating live web search could allow the agent to generate images based on the most current events.
  • The method may generalize to other modalities such as video generation requiring external references.
  • Overfitting risks could be mitigated by expanding the diversity of training prompts beyond the curated sets.

Load-bearing premise

The specific datasets and dual-reward GRPO training lead to genuine multi-hop search behavior and grounded generation rather than overfitting to the training prompts and images.

What would settle it

Evaluating the trained agent on a held-out set of prompts that require information absent from the training datasets and checking if it performs accurate searches and generates correct images.

Figures

Figures reproduced from arXiv: 2603.28767 by Chenyang Wang, Dian Zheng, Hongyu Li, Kaituo Feng, Kaixuan Fan, Manyuan Zhang, Shuang Chen, Xiangyu Yue, Yilei Jiang, Yunlong Lin.

Figure 1
Figure 1. Figure 1: Generated images using our proposed Gen-Searcher. [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Our proposed Gen-Searcher enables search-grounded generation in real-world knowledge-intensive scenarios. [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: An illustration of our data curation pipeline. [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Overview of the KnowGen benchmark. Evaluation Metric. To evaluate generation quality on KnowGen, we introduce K-Score, a metric designed to assess search-grounded image generation from multiple perspectives. We adopt GPT-4.1 [35] as the judge to evaluate model outputs, following WISE benchmark [16]. For each sample, the evaluator takes as input the original text prompt, the ground-truth reference image, an… view at source ↗
Figure 5
Figure 5. Figure 5: An inference example of Gen-Searcher. Search Tools. Gen-Searcher is equipped with three search tools. The first is search, which performs web text search and returns the top-k relevant webpage URLs for each query with their short snippets. This tool is mainly used to verify factual information such as entity names, event details, dates, locations, and concise descriptions. The second is image_search, which… view at source ↗
Figure 6
Figure 6. Figure 6: Examples of generated images by different methods on our KnowGen benchmark. [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Parameter Analysis on α. 5 Conclusion In this paper, we present Gen-Searcher, the first attempt to train a multimodal deep search agent for knowledge-intensive image generation with agentic RL. To enable this setting, we build a dedicated data pipeline, construct two training datasets Gen-Searcher-SFT-10k, Gen-Searcher-RL-6k, and introduce the KnowGen benchmark together with K-Score for evaluating real-wor… view at source ↗
read the original abstract

Recent image generation models have shown strong capabilities in generating high-fidelity and photorealistic images. However, they are fundamentally constrained by frozen internal knowledge, thus often failing on real-world scenarios that are knowledge-intensive or require up-to-date information. In this paper, we present Gen-Searcher, as the first attempt to train a search-augmented image generation agent, which performs multi-hop reasoning and search to collect the textual knowledge and reference images needed for grounded generation. To achieve this, we construct a tailored data pipeline and curate two high-quality datasets, Gen-Searcher-SFT-10k and Gen-Searcher-RL-6k, containing diverse search-intensive prompts and corresponding ground-truth synthesis images. We further introduce KnowGen, a comprehensive benchmark that explicitly requires search-grounded external knowledge for image generation and evaluates models from multiple dimensions. Based on these resources, we train Gen-Searcher with SFT followed by agentic reinforcement learning with dual reward feedback, which combines text-based and image-based rewards to provide more stable and informative learning signals for GRPO training. Experiments show that Gen-Searcher brings substantial gains, improving Qwen-Image by around 16 points on KnowGen and 15 points on WISE. We hope this work can serve as an open foundation for search agents in image generation, and we fully open-source our data, models, and code.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces Gen-Searcher as the first search-augmented image generation agent that performs multi-hop reasoning and external search to collect textual knowledge and reference images for grounded generation. The authors construct a tailored data pipeline to curate Gen-Searcher-SFT-10k and Gen-Searcher-RL-6k datasets containing search-intensive prompts and ground-truth synthesis images, introduce the KnowGen benchmark for evaluating search-grounded image generation across multiple dimensions, and train the model with SFT followed by agentic reinforcement learning using dual-reward GRPO that combines text-based and image-based signals. Experiments report substantial gains, improving Qwen-Image by around 16 points on KnowGen and 15 points on WISE.

Significance. If the reported gains reflect genuine acquisition of multi-hop search behavior that generalizes beyond the training distribution, the work would be significant as the first open foundation for agentic search in image generation. The open-sourcing of the datasets, models, and code is a concrete strength that enables reproducibility and follow-on research.

major comments (2)
  1. [Abstract / Data Construction] Abstract and data pipeline description: both the training sets (Gen-Searcher-SFT-10k, RL-6k) and the KnowGen benchmark are generated by the same 'tailored data pipeline' that produces 'search-intensive prompts and corresponding ground-truth synthesis images,' yet no quantitative checks (embedding similarity, n-gram overlap, or deduplication) are reported. This directly undermines the central claim of 16-point gains on KnowGen, because the dual-reward GRPO objective could improve scores by fitting the specific synthesis style and prompt patterns rather than learning robust external search.
  2. [Experiments] Experiments section: the abstract states clear numerical gains but supplies no information on baseline controls, statistical significance, data leakage verification, or ablation of the dual-reward component and the GRPO stage. Without these, the attribution of the 16- and 15-point improvements specifically to the agentic search pipeline remains only partially supported.
minor comments (1)
  1. [Abstract] The abstract refers to 'around 16 points' and '15 points' on KnowGen and WISE but does not name the underlying evaluation metrics or scoring protocol.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive feedback. The concerns about potential data overlap and the need for stronger experimental controls are valid and help improve the clarity of our claims. Below we respond point-by-point to the major comments.

read point-by-point responses
  1. Referee: [Abstract / Data Construction] Abstract and data pipeline description: both the training sets (Gen-Searcher-SFT-10k, RL-6k) and the KnowGen benchmark are generated by the same 'tailored data pipeline' that produces 'search-intensive prompts and corresponding ground-truth synthesis images,' yet no quantitative checks (embedding similarity, n-gram overlap, or deduplication) are reported. This directly undermines the central claim of 16-point gains on KnowGen, because the dual-reward GRPO objective could improve scores by fitting the specific synthesis style and prompt patterns rather than learning robust external search.

    Authors: We agree that the absence of quantitative overlap checks is a limitation in the current manuscript. The tailored pipeline was designed to produce diverse, search-intensive prompts drawn from varied knowledge domains, with manual curation steps intended to promote variety. Nevertheless, without reported metrics it is difficult to fully rule out style or pattern fitting. In the revision we will add embedding similarity (e.g., cosine similarity on sentence embeddings), n-gram overlap statistics, and explicit deduplication results between the SFT/RL training sets and KnowGen. These analyses will be placed in a new subsection of the data construction section. revision: yes

  2. Referee: [Experiments] Experiments section: the abstract states clear numerical gains but supplies no information on baseline controls, statistical significance, data leakage verification, or ablation of the dual-reward component and the GRPO stage. Without these, the attribution of the 16- and 15-point improvements specifically to the agentic search pipeline remains only partially supported.

    Authors: The manuscript reports gains relative to Qwen-Image on KnowGen and WISE, but we acknowledge that details on statistical significance, explicit leakage verification, and component ablations are missing. In the revised version we will (1) add statistical significance testing (e.g., paired t-tests or bootstrap confidence intervals) for the reported improvements, (2) incorporate the overlap and leakage checks described above, and (3) include ablation studies that isolate the contribution of the dual-reward signals versus the GRPO stage alone. These additions will be presented in an expanded Experiments section with a dedicated ablation table. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical pipeline with external benchmarks.

full rationale

The paper describes constructing a data pipeline to curate Gen-Searcher-SFT-10k and Gen-Searcher-RL-6k, training via SFT then dual-reward GRPO, and reporting empirical gains on the separately introduced KnowGen benchmark plus WISE. No equations, uniqueness theorems, fitted parameters renamed as predictions, or self-citation chains appear in the provided text. All load-bearing claims are performance deltas on held-out benchmarks, which remain falsifiable outside the training distribution. Potential distributional overlap between pipeline-generated sets is a data-validity issue, not a reduction of any derivation to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No explicit free parameters, axioms, or invented entities are described in the abstract; the work rests on standard supervised and reinforcement learning assumptions common to the field.

pith-pipeline@v0.9.0 · 5811 in / 1194 out tokens · 35678 ms · 2026-05-25T06:28:48.045875+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 9 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. GenEvolve: Self-Evolving Image Generation Agents via Tool-Orchestrated Visual Experience Distillation

    cs.CV 2026-05 unverdicted novelty 7.0

    GenEvolve proposes a self-evolving agent framework for open-ended image generation that uses tool-orchestrated trajectories and visual experience distillation from best-worst differences to achieve reported state-of-t...

  2. Aurora: Unified Video Editing with a Tool-Using Agent

    cs.CV 2026-05 unverdicted novelty 7.0

    Aurora introduces a VLM-based agent that converts raw user video edit requests into structured conditioning inputs for a unified diffusion transformer, improving performance on underspecified tasks via a new benchmark.

  3. From Web to Pixels: Bringing Agentic Search into Visual Perception

    cs.CV 2026-05 unverdicted novelty 7.0

    WebEye benchmark and Pixel-Searcher agent enable visual perception tasks by using web search to resolve object identities before precise localization or answering.

  4. Flow-OPD: On-Policy Distillation for Flow Matching Models

    cs.CV 2026-05 conditional novelty 7.0

    Flow-OPD applies on-policy distillation to flow matching models via specialized teachers, cold-start initialization, and manifold anchor regularization, lifting GenEval from 63 to 92 and OCR from 59 to 94 on Stable Di...

  5. Flow-OPD: On-Policy Distillation for Flow Matching Models

    cs.CV 2026-05 unverdicted novelty 6.0

    Flow-OPD applies on-policy distillation to Flow Matching models through specialized teachers, cold-start initialization, task routing, and manifold regularization, lifting GenEval from 63 to 92 and OCR from 59 to 94 o...

  6. Flow-OPD: On-Policy Distillation for Flow Matching Models

    cs.CV 2026-05 unverdicted novelty 6.0

    Flow-OPD applies on-policy distillation to flow matching models, achieving GenEval of 92 and OCR accuracy of 94 on Stable Diffusion 3.5 Medium while avoiding the seesaw effect of multi-reward optimization.

  7. Flow-OPD: On-Policy Distillation for Flow Matching Models

    cs.CV 2026-05 unverdicted novelty 6.0

    Flow-OPD applies on-policy distillation to flow-matching text-to-image models, lifting GenEval from 63 to 92 and OCR accuracy from 59 to 94 while preserving fidelity.

  8. SCOPE: Structured Decomposition and Conditional Skill Orchestration for Complex Image Generation

    cs.CV 2026-05 unverdicted novelty 6.0

    SCOPE maintains semantic commitments via structured specifications and conditional skill orchestration, achieving 0.60 EGIP on the new Gen-Arena benchmark while outperforming baselines on WISE-V and MindBench.

  9. GenEvolve: Self-Evolving Image Generation Agents via Tool-Orchestrated Visual Experience Distillation

    cs.CV 2026-05 unverdicted novelty 5.0

    GenEvolve introduces a self-evolving agent framework for image generation using tool-orchestrated trajectories and Visual Experience Distillation to achieve claimed SOTA results on benchmarks.

Reference graph

Works this paper leans on

61 extracted references · 61 canonical work pages · cited by 5 Pith papers · 19 internal anchors

  1. [1]

    Qwen-Image Technical Report

    Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al. Qwen-image technical report.arXiv preprint arXiv:2508.02324, 2025

  2. [2]

    Gemini image pro: High-quality image generation

    Google DeepMind. Gemini image pro: High-quality image generation. https://deepmind.google/models/ gemini-image/pro/, 2025

  3. [3]

    Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer

    Huanqia Cai, Sihan Cao, Ruoyi Du, Peng Gao, Steven Hoi, Zhaohui Hou, Shijie Huang, Dengyang Jiang, Xin Jin, Liangchen Li, et al. Z-image: An efficient image generation foundation model with single-stream diffusion transformer.arXiv preprint arXiv:2511.22699, 2025

  4. [4]

    Re-imagen: Retrieval-augmented text-to-image generator

    Wenhu Chen, Hexiang Hu, Chitwan Saharia, and William W Cohen. Re-imagen: Retrieval-augmented text-to-image generator. arXiv preprint arXiv:2209.14491, 2022

  5. [5]

    Retrieval-augmented diffusion models

    Andreas Blattmann, Robin Rombach, Kaan Oktay, Jonas Müller, and Björn Ommer. Retrieval-augmented diffusion models. Advances in Neural Information Processing Systems, 35:15309–15324, 2022

  6. [6]

    M2io-r1: An efficient rl-enhanced reasoning framework for multimodal retrieval augmented multimodal generation.arXiv preprint arXiv:2508.06328, 2025

    Zhiyou Xiao, Qinhan Yu, Binghui Li, Geng Chen, Chong Chen, and Wentao Zhang. M2io-r1: An efficient rl-enhanced reasoning framework for multimodal retrieval augmented multimodal generation.arXiv preprint arXiv:2508.06328, 2025

  7. [7]

    Ia-t2i: Internet-augmented text-to-image generation.arXiv preprint arXiv:2505.15779, 2025

    Chuanhao Li, Jianwen Sun, Yukang Feng, Mingliang Zhai, Yifan Chang, and Kaipeng Zhang. Ia-t2i: Internet-augmented text-to-image generation.arXiv preprint arXiv:2505.15779, 2025

  8. [8]

    Mind-brush: Integrating agentic cognitive search and reasoning into image generation.arXiv preprint arXiv:2602.01756, 2026

    Jun He, Junyan Ye, Zilong Huang, Dongzhi Jiang, Chenjue Zhang, Leqi Zhu, Renrui Zhang, Xiang Zhang, and Weijia Li. Mind-brush: Integrating agentic cognitive search and reasoning into image generation.arXiv preprint arXiv:2602.01756, 2026

  9. [9]

    WebWatcher: Breaking New Frontier of Vision-Language Deep Research Agent

    Xinyu Geng, Peng Xia, Zhen Zhang, Xinyu Wang, Qiuchen Wang, Ruixue Ding, Chenxi Wang, Jialong Wu, Yida Zhao, Kuan Li, et al. Webwatcher: Breaking new frontier of vision-language deep research agent.arXiv preprint arXiv:2508.05748, 2025

  10. [10]

    Exploring Reasoning Reward Model for Agents

    Kaixuan Fan, Kaituo Feng, Manyuan Zhang, Tianshuo Peng, Zhixun Li, Yilei Jiang, Shuang Chen, Peng Pei, Xunliang Cai, and Xiangyu Yue. Exploring reasoning reward model for agents.arXiv preprint arXiv:2601.22154, 2026

  11. [11]

    Gemini 3 pro.https://deepmind.google/models/gemini/pro/, 2025

    Google DeepMind. Gemini 3 pro.https://deepmind.google/models/gemini/pro/, 2025

  12. [12]

    Seed1.8 model card: Towards generalized real-world agency

    Bytedance Seed. Seed1.8 model card: Towards generalized real-world agency. https://seed.bytedance.com/en/ seed1_8, 2025. 13 Gen-Searcher: Reinforcing Agentic Search for Image Generation

  13. [13]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

  14. [14]

    OneThinker: All-in-one Reasoning Model for Image and Video

    Kaituo Feng, Manyuan Zhang, Hongyu Li, Kaixuan Fan, Shuang Chen, Yilei Jiang, Dian Zheng, Peiwen Sun, Yiyuan Zhang, Haoze Sun, et al. Onethinker: All-in-one reasoning model for image and video.arXiv preprint arXiv:2512.03043, 2025

  15. [15]

    Seedream 4.5.https://seed.bytedance.com/en/seedream4_5, 2025

    Bytedance Seed. Seedream 4.5.https://seed.bytedance.com/en/seedream4_5, 2025

  16. [16]

    WISE: A World Knowledge-Informed Semantic Evaluation for Text-to-Image Generation

    Yuwei Niu, Munan Ning, Mengren Zheng, Weiyang Jin, Bin Lin, Peng Jin, Jiaqi Liao, Chaoran Feng, Kunpeng Ning, Bin Zhu, et al. Wise: A world knowledge-informed semantic evaluation for text-to-image generation.arXiv preprint arXiv:2503.07265, 2025

  17. [17]

    Editthinker: Unlocking iterative reasoning for any image editor.arXiv preprint arXiv:2512.05965, 2025

    Hongyu Li, Manyuan Zhang, Dian Zheng, Ziyu Guo, Yimeng Jia, Kaituo Feng, Hao Yu, Yexin Liu, Yan Feng, Peng Pei, et al. Editthinker: Unlocking iterative reasoning for any image editor.arXiv preprint arXiv:2512.05965, 2025

  18. [18]

    AIA: Rethinking Architecture Decoupling Strategy In Unified Multimodal Model

    Dian Zheng, Manyuan Zhang, Hongyu Li, Kai Zou, Hongbo Liu, Ziyu Guo, Kaituo Feng, Yexin Liu, Ying Luo, Yan Feng, et al. Architecture decoupling is not all you need for unified multimodal model.arXiv preprint arXiv:2511.22663, 2025

  19. [19]

    Comprehensive exploration of diffusion models in image generation: a survey.Artificial Intelligence Review, 58(4):99, 2025

    Hang Chen, Qian Xiang, Jiaxin Hu, Meilin Ye, Chao Yu, Hao Cheng, and Lei Zhang. Comprehensive exploration of diffusion models in image generation: a survey.Artificial Intelligence Review, 58(4):99, 2025

  20. [20]

    Stable diffusion 3.5 large

    Stability AI. Stable diffusion 3.5 large. https://huggingface.co/stabilityai/stable-diffusion-3. 5-large, 2024

  21. [21]

    Imagen.https://deepmind.google/models/imagen/, 2025

    Google DeepMind. Imagen.https://deepmind.google/models/imagen/, 2025

  22. [22]

    Flux 1.https://github.com/black-forest-labs/flux, 2024

    black-forest labs. Flux 1.https://github.com/black-forest-labs/flux, 2024

  23. [23]

    LongCat-Image Technical Report

    Meituan LongCat Team, Hanghang Ma, Haoxian Tan, Jiale Huang, Junqiang Wu, Jun-Yan He, Lishuai Gao, Songlin Xiao, Xiaoming Wei, Xiaoqi Ma, et al. Longcat-image technical report.arXiv preprint arXiv:2512.07584, 2025

  24. [24]

    Agentic Reinforced Policy Optimization

    Guanting Dong, Hangyu Mao, Kai Ma, Licheng Bao, Yifei Chen, Zhongyuan Wang, Zhongxia Chen, Jiazhen Du, Huiyang Wang, Fuzheng Zhang, et al. Agentic reinforced policy optimization.arXiv preprint arXiv:2507.19849, 2025

  25. [25]

    AdaTooler-V: Adaptive Tool-Use for Images and Videos

    Chaoyang Wang, Kaituo Feng, Dongyang Chen, Zhongyu Wang, Zhixun Li, Sicheng Gao, Meng Meng, Xu Zhou, Manyuan Zhang, Yuzhang Shang, et al. Adatooler-v: Adaptive tool-use for images and videos.arXiv preprint arXiv:2512.16918, 2025

  26. [26]

    Reinforcing Spatial Reasoning in Vision-Language Models with Interwoven Thinking and Visual Drawing

    Junfei Wu, Jian Guan, Kaituo Feng, Qiang Liu, Shu Wu, Liang Wang, Wei Wu, and Tieniu Tan. Reinforcing spatial reasoning in vision-language models with interwoven thinking and visual drawing.arXiv preprint arXiv:2506.09965, 2025

  27. [27]

    Critique-grpo: Advancing llm reasoning with natural language and numerical feedback.arXiv preprint arXiv:2506.03106, 2025

    Xiaoying Zhang, Yipeng Zhang, Hao Sun, Kaituo Feng, Chaochao Lu, Chao Yang, and Helen Meng. Critique-grpo: Advancing llm reasoning with natural language and numerical feedback.arXiv preprint arXiv:2506.03106, 2025

  28. [28]

    Video-R1: Reinforcing Video Reasoning in MLLMs

    Kaituo Feng, Kaixiong Gong, Bohao Li, Zonghao Guo, Yibing Wang, Tianshuo Peng, Junfei Wu, Xiaoying Zhang, Benyou Wang, and Xiangyu Yue. Video-r1: Reinforcing video reasoning in mllms.arXiv preprint arXiv:2503.21776, 2025

  29. [29]

    Advancing multimodal reasoning: From optimized cold start to staged reinforcement learning.arXiv preprint arXiv:2506.04207, 2025

    Shuang Chen, Yue Guo, Zhaochen Su, Yafu Li, Yulun Wu, Jiacheng Chen, Jiayu Chen, Weijie Wang, Xiaoye Qu, and Yu Cheng. Advancing multimodal reasoning: From optimized cold start to staged reinforcement learning.arXiv preprint arXiv:2506.04207, 2025

  30. [30]

    Sophiavl-r1: Reinforcing mllms reasoning with thinking reward.arXiv preprint arXiv:2505.17018, 2025

    Kaixuan Fan, Kaituo Feng, Haoming Lyu, Dongzhan Zhou, and Xiangyu Yue. Sophiavl-r1: Reinforcing mllms reasoning with thinking reward.arXiv preprint arXiv:2505.17018, 2025

  31. [31]

    Ares: Multimodal adaptive reasoning via difficulty-aware token-level entropy shaping.arXiv preprint arXiv:2510.08457, 2025

    Shuang Chen, Yue Guo, Yimeng Ye, Shijue Huang, Wenbo Hu, Haoxi Li, Manyuan Zhang, Jiayu Chen, Song Guo, and Nanyun Peng. Ares: Multimodal adaptive reasoning via difficulty-aware token-level entropy shaping.arXiv preprint arXiv:2510.08457, 2025

  32. [32]

    Group-in-Group Policy Optimization for LLM Agent Training

    Lang Feng, Zhenghai Xue, Tingcong Liu, and Bo An. Group-in-group policy optimization for llm agent training.arXiv preprint arXiv:2505.10978, 2025

  33. [33]

    Vision-deepresearch: Incentivizing deepresearch capability in multimodal large language models.arXiv preprint arXiv:2601.22060, 2026

    Wenxuan Huang, Yu Zeng, Qiuchen Wang, Zhen Fang, Shaosheng Cao, Zheng Chu, Qingyu Yin, Shuang Chen, Zhenfei Yin, Lin Chen, et al. Vision-deepresearch: Incentivizing deepresearch capability in multimodal large language models.arXiv preprint arXiv:2601.22060, 2026

  34. [34]

    Simpledeepsearcher: Deep information seeking via web-powered reasoning trajectory synthesis.arXiv preprint arXiv:2505.16834, 2025

    Shuang Sun, Huatong Song, Yuhao Wang, Ruiyang Ren, Jinhao Jiang, Junjie Zhang, Fei Bai, Jia Deng, Wayne Xin Zhao, Zheng Liu, et al. Simpledeepsearcher: Deep information seeking via web-powered reasoning trajectory synthesis.arXiv preprint arXiv:2505.16834, 2025

  35. [35]

    Introducing gpt-4.1 in the api.https://openai.com/index/gpt-4-1/, 2025

    OpenAI. Introducing gpt-4.1 in the api.https://openai.com/index/gpt-4-1/, 2025

  36. [36]

    Qwen3-VL Technical Report

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

  37. [37]

    Gpt-image-1: Models and capabilities for image generation

    OpenAI. Gpt-image-1: Models and capabilities for image generation. https://platform.openai.com/docs/ models/gpt-image-1, 2024

  38. [38]

    Gpt-image-1.5: Enhanced visual reasoning and creative generation.https://platform.openai.com/docs/ models/gpt-image-1.5, 2025

    OpenAI. Gpt-image-1.5: Enhanced visual reasoning and creative generation.https://platform.openai.com/docs/ models/gpt-image-1.5, 2025. 14 Gen-Searcher: Reinforcing Agentic Search for Image Generation

  39. [39]

    Gemini image: High-quality image generation

    Google DeepMind. Gemini image: High-quality image generation. https://deepmind.google/models/ gemini-image/flash/, 2025

  40. [40]

    Seedream 4.0: Toward Next-generation Multimodal Image Generation

    Team Seedream, Yunpeng Chen, Yu Gao, Lixue Gong, Meng Guo, Qiushan Guo, Zhiyao Guo, Xiaoxia Hou, Weilin Huang, Yixuan Huang, et al. Seedream 4.0: Toward next-generation multimodal image generation.arXiv preprint arXiv:2509.20427, 2025

  41. [41]

    Stable diffusion 3.5 medium

    Stability AI. Stable diffusion 3.5 medium. https://huggingface.co/stabilityai/stable-diffusion-3. 5-medium, 2024

  42. [42]

    Lumina- image 2.0: A unified and efficient image generative framework

    Qi Qin, Le Zhuo, Yi Xin, Ruoyi Du, Zhen Li, Bin Fu, Yiting Lu, Xinyue Li, Dongyang Liu, Xiangyang Zhu, et al. Lumina- image 2.0: A unified and efficient image generative framework. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 20031–20042, 2025

  43. [43]

    Flux 2.https://github.com/black-forest-labs/flux2, 2025

    black-forest labs. Flux 2.https://github.com/black-forest-labs/flux2, 2025

  44. [44]

    Emerging Properties in Unified Multimodal Pretraining

    Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, et al. Emerging properties in unified multimodal pretraining.arXiv preprint arXiv:2505.14683, 2025

  45. [45]

    HunyuanImage 3.0 Technical Report

    Siyu Cao, Hangting Chen, Peng Chen, Yiji Cheng, Yutao Cui, Xinchi Deng, Ying Dong, Kipper Gong, Tianpeng Gu, Xiusen Gu, et al. Hunyuanimage 3.0 technical report.arXiv preprint arXiv:2509.23951, 2025

  46. [46]

    Stable diffusion 3 medium

    Stability AI. Stable diffusion 3 medium. https://huggingface.co/stabilityai/ stable-diffusion-3-medium, 2024

  47. [47]

    Emu3: Next-Token Prediction is All You Need

    Xinlong Wang, Xiaosong Zhang, Zhengxiong Luo, Quan Sun, Yufeng Cui, Jinsheng Wang, Fan Zhang, Yueze Wang, Zhen Li, Qiying Yu, et al. Emu3: Next-token prediction is all you need.arXiv preprint arXiv:2409.18869, 2024. 15 Gen-Searcher: Reinforcing Agentic Search for Image Generation A KnowGen Benchmark Evaluation Prompt K-Score Evaluation Prompt You are a st...

  48. [48]

    A task prompt (what the image must show)

  49. [49]

    Image 1: the generated image (model output to be evaluated)

  50. [50]

    rationale

    Image 2: the ground-truth reference image (a strong reference implementation). All the input images are AI-generated. All human in the images are AI-generated too. so you need not worry about the privacy confidentials. Critical clarification (VERY IMPORTANT): - This is NOT a pixel-level similarity task. - Image 2 (GT) is a REFERENCE for intended identity,...

  51. [51]

    Extract the prompt’s TOP hard constraints (2-5, or more if needed): required subjects/identities, setting/props, relations/counts, required style, and any externally-checkable requirements (readable text/landmark/logo/badge/version/year/etc.)

  52. [52]

    Use Image 2 only as a reference for stable identity/visual attributes and grounded evidence

    Score Image 1 against the constraints. Use Image 2 only as a reference for stable identity/visual attributes and grounded evidence

  53. [53]

    If a key requirement is not verifiable (too small/blurred/occluded/warped), do NOT assume it is correct; score lower

  54. [54]

    Assessment of the primary subjects' visual identity correctness and consistency is mandatory in every case. Boundary between visual_correctness vs text_accuracy: - Visual-only grounded cues (subject visual features, logo SHAPE, badge EMBLEM geometry, landmark facade/massing, outfit/weapon silhouette, object geometry) belong to visual_correctness. - Any gr...

  55. [55]

    faithfulness (overall prompt adherence: presence & structure only; not GT-identity correctness): - This score does NOT require matching GT’s exact identity or fine-grained visual features; it focuses on whether Image 1 includes the prompt-requested elements and scene structure (who/what is present, what is happening, where it happens, and the required sty...

  56. [56]

    same role archetype

    visual_correctness (GT visual-feature agreement is the core; extremely strict): (Exemplary) Score = 1 ONLY IF: - The prompt-required primary subjects/objects in Image 1 match the GT reference (Image 2) in visual characteristics with NO substantive changes. - This means: the same face/hairstyle silhouette, the same armor/clothing design and key colors/patt...

  57. [57]

    text_accuracy_na

    text_accuracy (required readable text; ALL relevant text must be correct AND very clearly readable; NO partial credit for wrong text): Rule: - If the prompt does NOT require any readable text: you MUST output "text_accuracy_na": true and "text_accuracy": 0.5 in the JSON. In your rationale state that the prompt did not require readable text. - If the promp...

  58. [58]

    Constraints:

    aesthetics: (Exemplary) Score = 1 ONLY IF: - Masterpiece-level composition and polish, AND Image 1 is NOT worse than GT in overall aesthetic quality. (Conditional) Score = 0.5 ONLY IF: - Very beautiful and polished, but slightly worse than GT (ONLY slightly) OR slightly less refined than top-tier. (Rejected) Score = 0 IF: - Anything clearly worse than GT ...

  59. [59]

    Task prompt: the original user requirement (what image we want to generate)

  60. [60]

    Ground-truth reference image: the target image we want the pipeline to produce

  61. [61]

    rationale

    Model's answer: the model's output in <answer>, containing: - gen_prompt: a natural-language prompt for an image generator (composition, style, subjects, etc.). - reference_images: a list of chosen reference images (each with img_id, title, note, etc.) that the model selected from search to guide generation. Your task (TEXT + VISUAL): - From both TEXT and...