pith. machine review for the scientific record. sign in

arxiv: 2604.25636 · v1 · submitted 2026-04-28 · 💻 cs.CV

Recognition: unknown

Refinement via Regeneration: Enlarging Modification Space Boosts Image Refinement in Unified Multimodal Models

Authors on Pith no claims yet

Pith reviewed 2026-05-07 16:51 UTC · model grok-4.3

classification 💻 cs.CV
keywords image refinementunified multimodal modelstext-to-image generationsemantic tokensregenerationediting paradigmprompt alignment
0
0 comments X

The pith

Regenerating images from a prompt plus semantic tokens of the first output lets unified multimodal models fix misalignments more completely than editing.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Unified multimodal models handle both understanding and generating images in one system. Their text-to-image outputs can be refined afterward, but existing methods edit only the mismatched parts using coarse instructions while keeping most pixels fixed. This paper replaces that with regeneration: the model creates a fresh image using the target prompt together with semantic tokens taken from the initial image. The change removes the need for editing instructions and drops the strict preservation rule, giving the model a bigger space to adjust content for better prompt match. Experiments show this raises alignment scores on Geneval, DPGBench, and UniGenBench++.

Core claim

The paper claims that reformulating refinement as conditional image regeneration, conditioned on the target prompt and semantic tokens of the initial image, enables more complete semantic alignment with a larger modification space than the refinement-via-editing paradigm, which relies on coarse editing instructions and pixel-level preservation.

What carries the argument

Refinement via Regeneration (RvR) is the mechanism that produces a new image from the target prompt and semantic tokens extracted from the initial image, replacing editing instructions and strict pixel preservation.

If this is right

  • Unified multimodal models can reach higher prompt-image alignment on text-to-image tasks without being limited by coarse editing instructions.
  • Pixel-level preservation is unnecessary for effective refinement and can be dropped to enlarge the space of possible adjustments.
  • The performance upper bound for text-to-image generation inside unified models is extended by treating refinement as full regeneration.
  • Refinement no longer depends on the model first producing editing instructions that only partially describe misalignments.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same regeneration logic could be tested on other generation tasks such as image editing or captioning where token-based conditioning might also expand flexibility.
  • If semantic tokens prove sufficient, hybrid systems could use regeneration for global fixes and editing for local touch-ups on the same model.
  • The approach implies that richer semantic extraction from initial outputs may be more valuable than stronger editing modules in future unified models.

Load-bearing premise

Semantic tokens taken from the initial image carry enough information to support high-quality regeneration while the extra freedom in changes produces net gains rather than new artifacts or lost detail.

What would settle it

A side-by-side comparison on the same set of prompts where regenerated images show lower visual quality, more artifacts, or greater loss of fine detail than images refined by the editing approach.

Figures

Figures reproduced from arXiv: 2604.25636 by Chunyu Wang, Gao Huang, Jiangshan Wang, Jiayi Guo, Linqing Wang, Qinglin Lu, Yang Yue, Zeyu Liu, Zhiyuan Zhao.

Figure 1
Figure 1. Figure 1: Refinement via Regeneration (RvR) largely improves text-to-image view at source ↗
Figure 2
Figure 2. Figure 2: Qualitative examples before and after RvR refinement. view at source ↗
Figure 3
Figure 3. Figure 3: Comparison between (a) prior Refinement via Editing (RvE) and view at source ↗
Figure 4
Figure 4. Figure 4: Data construction pipeline for RvR. Step 1: An LLM generates prompts based on randomly selected semantic dimensions. Step 2: Multiple T2I generators in￾dependently generate images. Step 3: A VLM evaluates prompt–image alignment and labels generated images as aligned or misaligned. Each final training sample is con￾structed as a triplet of ⟨misaligned image, aligned image, prompt⟩. ⟨original image, edited i… view at source ↗
Figure 5
Figure 5. Figure 5: Overview of RvR. During training (a), the unified multimodal model (UMM) takes text tokens, ViT tokens from a misaligned image, and noisy VAE tokens from an aligned image, and learns velocity prediction for denoising. During inference (b), conditioned on the system prompt and misaligned image, the UMM denoises the noise to refined VAE tokens, which are decoded into the final aligned image. and output image… view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative comparison with RvE methods. view at source ↗
Figure 7
Figure 7. Figure 7: Multi-round generation. RvR supports iterative refinement across multiple rounds. (a) Additional rounds can further correct misaligned semantics that remain unresolved after the first round. (b) When the semantics are already correctly aligned after the first round, another round preserves the correct results. In view at source ↗
Figure 8
Figure 8. Figure 8: Robustness to initial image semantics. (a) RvR reuses prompt-compatible semantics from the initial image to facilitate regeneration. (b) When the initial seman￾tics conflict with the prompt, RvR discards them and regenerates new aligned content. columns illustrate cases where the initial image contains semantics compatible with the prompts. For example, it is natural for a dog to lie on grass and rea￾sonab… view at source ↗
Figure 9
Figure 9. Figure 9: Attention mask for RvR. We follow standard UMM training to apply causal attention for text tokens (prompts) and full attention for image tokens (ViT and VAE). B Refinement Data Examples view at source ↗
Figure 10
Figure 10. Figure 10: visualizes several refinement data examples used for RvR training. Unlike image editing, the misaligned and aligned images are independently generated by the UMM’s T2I process. This independence removes unnecessary pixel-level constraints and encourages RvR to focus on semantic correction. Generate an image of a team lineup for a fantasy video game. The image should display exactly seven elven archers sta… view at source ↗
Figure 11
Figure 11. Figure 11: presents additional qualitative results further demonstrating the refine￾ment performance of RvR, including examples with Chinese prompts to showcase the bilingual refinement capability. 水墨画风格,一只憨 态可掬的大熊猫正坐 在雄伟的长城上,悠 闲地啃着竹子。 蒸汽朋克风格的青铜 雕塑,一个长着翅膀 的拟人化闹钟,正从 一座漂浮的岛屿上起 飞,表情惊讶。 一只青瓷茶壶长出了斑 斓的蝴蝶翅膀,在朦胧 的远山之间飞行,浓郁 的国风水墨画风格。 一个穿着宇航服的 土豆宇航员正在月 球上浇灌一株玫瑰, 它看起来很开心, 赛博朋克风格。 生成一幅画面:一个全 身由冰块构成的人,正 小心翼翼地捧着一团燃 烧的火焰,他的手指已 经开始… view at source ↗
read the original abstract

Unified multimodal models (UMMs) integrate visual understanding and generation within a single framework. For text-to-image (T2I) tasks, this unified capability allows UMMs to refine outputs after their initial generation, potentially extending the performance upper bound. Current UMM-based refinement methods primarily follow a refinement-via-editing (RvE) paradigm, where UMMs produce editing instructions to modify misaligned regions while preserving aligned content. However, editing instructions often describe prompt-image misalignment only coarsely, leading to incomplete refinement. Moreover, pixel-level preservation, though necessary for editing, unnecessarily restricts the effective modification space for refinement. To address these limitations, we propose Refinement via Regeneration (RvR), a novel framework that reformulates refinement as conditional image regeneration rather than editing. Instead of relying on editing instructions and enforcing strict content preservation, RvR regenerates images conditioned on the target prompt and the semantic tokens of the initial image, enabling more complete semantic alignment with a larger modification space. Extensive experiments demonstrate the effectiveness of RvR, improving Geneval from 0.78 to 0.91, DPGBench from 84.02 to 87.21, and UniGenBench++ from 61.53 to 77.41.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes Refinement via Regeneration (RvR) as an alternative to the dominant refinement-via-editing (RvE) paradigm in unified multimodal models for text-to-image tasks. Rather than generating editing instructions and enforcing pixel-level preservation of aligned content, RvR regenerates the full image conditioned on the target prompt together with semantic tokens extracted from the initial image. The authors argue this enlarges the effective modification space and yields more complete semantic alignment. They report large empirical gains: Geneval rises from 0.78 to 0.91, DPGBench from 84.02 to 87.21, and UniGenBench++ from 61.53 to 77.41.

Significance. If the reported gains prove robust and can be causally attributed to the regeneration paradigm rather than confounding factors, the work would offer a useful new direction for post-generation refinement in UMMs. The empirical deltas are sizable and the framing of “enlarging modification space” is conceptually clear. Credit is due for introducing a concrete alternative framework and for the scale of the benchmark improvements shown.

major comments (2)
  1. [Abstract] Abstract: The central claim that conditioning on semantic tokens produces a meaningfully larger modification space (as opposed to merely guided reconstruction) is load-bearing yet unsupported by any ablation, diversity metric, or visualization comparing RvR outputs to the initial image or to RvE outputs. Without such evidence it remains possible that the tokens re-introduce layout constraints and that the observed benchmark gains arise from other variables.
  2. [Experiments] Experiments section: The benchmark deltas are presented without any description of experimental controls, number of runs, statistical significance tests, or failure-case analysis. This omission prevents assessment of whether the improvements are stable or sensitive to post-hoc choices such as inference steps or prompt re-weighting.
minor comments (1)
  1. [Abstract] Abstract: A single sentence clarifying how semantic tokens are extracted (e.g., from which layer or encoder) would help readers evaluate the weakest assumption noted in the review.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the two major comments below and will revise the manuscript accordingly to strengthen the presentation of our claims and experimental details.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim that conditioning on semantic tokens produces a meaningfully larger modification space (as opposed to merely guided reconstruction) is load-bearing yet unsupported by any ablation, diversity metric, or visualization comparing RvR outputs to the initial image or to RvE outputs. Without such evidence it remains possible that the tokens re-introduce layout constraints and that the observed benchmark gains arise from other variables.

    Authors: We agree that the manuscript would benefit from explicit empirical support for the enlarged modification space. Although the conceptual distinction between regeneration and editing is developed in the introduction and method sections, we will add a dedicated ablation subsection with diversity metrics (such as pairwise LPIPS distances and semantic variance measures) and side-by-side visualizations of RvR outputs versus both the initial image and RvE outputs. These additions will directly address whether semantic-token conditioning permits greater flexibility or inadvertently re-imposes layout constraints. revision: yes

  2. Referee: [Experiments] Experiments section: The benchmark deltas are presented without any description of experimental controls, number of runs, statistical significance tests, or failure-case analysis. This omission prevents assessment of whether the improvements are stable or sensitive to post-hoc choices such as inference steps or prompt re-weighting.

    Authors: We acknowledge that the current experimental reporting lacks sufficient detail on controls and robustness. In the revised manuscript we will expand the Experiments section to specify the number of independent runs, include statistical significance tests (e.g., paired t-tests across benchmarks), provide a failure-case analysis, and report sensitivity results for inference steps and prompt re-weighting. These additions will allow readers to evaluate the stability of the reported gains. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical proposal without self-referential derivations

full rationale

The paper introduces RvR as a new framework reformulating refinement as conditional regeneration using target prompts and semantic tokens, then validates it solely via benchmark gains (Geneval 0.78→0.91, etc.). No equations, parameter fits, uniqueness theorems, or ansatzes are described that could reduce to inputs by construction. The derivation chain consists of a conceptual reformulation plus experimental demonstration; nothing is shown to be equivalent to prior fitted quantities or self-citations. This is self-contained against external benchmarks and matches the expected non-finding for method papers lacking mathematical derivations.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The claim depends on the domain assumption that semantic tokens capture sufficient content for regeneration and that enlarging the modification space is beneficial; no free parameters or invented physical entities are introduced.

axioms (1)
  • domain assumption Semantic tokens extracted from an initial image provide adequate conditioning information for high-quality regeneration aligned to a new prompt.
    Invoked when the method replaces pixel-level preservation with token-based conditioning.
invented entities (1)
  • Refinement via Regeneration (RvR) framework no independent evidence
    purpose: Reformulate refinement as conditional regeneration to enlarge modification space
    Newly proposed method; no independent evidence outside the paper's experiments is supplied.

pith-pipeline@v0.9.0 · 5553 in / 1236 out tokens · 46291 ms · 2026-05-07T16:51:19.679812+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

62 extracted references · 27 canonical work pages · 15 internal anchors

  1. [1]

    GPT-4 Technical Report

    Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F.L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al.: Gpt-4 technical report. arXiv:2303.08774 (2023)

  2. [2]

    In: ICCV (2015)

    Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Zitnick, C.L., Parikh, D.: VQA: Visual question answering. In: ICCV (2015)

  3. [3]

    Qwen2.5-VL Technical Report

    Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., et al.: Qwen2. 5-vl technical report. arXiv preprint arXiv:2502.13923 (2025)

  4. [4]

    OpenAI technical report (2023)

    Betker, J., Goh, G., Jing, L., Brooks, T., Wang, J., Li, L., Ouyang, L., Zhuang, J., Lee, J., Guo, Y., et al.: Improving image generation with better captions. OpenAI technical report (2023)

  5. [5]

    Hunyuanimage 3.0 technical report.arXiv preprint arXiv:2509.23951, 2025

    Cao, S., Chen, H., Chen, P., Cheng, Y., Cui, Y., Deng, X., Dong, Y., Gong, K., Gu, T., Gu, X., et al.: Hunyuanimage 3.0 technical report. arXiv preprint arXiv:2509.23951 (2025)

  6. [6]

    BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset

    Chen, J., Xu, Z., Pan, X., Hu, Y., Qin, C., Goldstein, T., Huang, L., Zhou, T., Xie, S., Savarese, S., et al.: Blip3-o: A family of fully open unified multimodal models- architecture, training and dataset, 2025. URL https://arxiv. org/abs/2505.09568 (2025)

  7. [7]

    Blip3o-next: Next frontier of native image generation.arXiv preprint arXiv:2510.15857, 2025

    Chen, J., Xue, L., Xu, Z., Pan, X., Yang, S., Qin, C., Yan, A., Zhou, H., Chen, Z., Huang, L., et al.: Blip3o-next: Next frontier of native image generation. arXiv preprint arXiv:2510.15857 (2025)

  8. [8]

    In: ICLR (2024)

    Chen, J., Yu, J., Ge, C., Yao, L., Xie, E., Wu, Y., Wang, Z., Kwok, J., Luo, P., Lu, H., et al.: Pixart-α: Fast training of diffusion transformer for photorealistic text-to-image synthesis. In: ICLR (2024)

  9. [9]

    Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling

    Chen, X., Wu, Z., Liu, X., Pan, Z., Liu, W., Xie, Z., Yu, X., Ruan, C.: Janus-pro: Unified multimodal understanding and generation with data and model scaling. arXiv preprint arXiv:2501.17811 (2025)

  10. [10]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Comanici, G., Bieber, E., Schaekermann, M., Pasupat, I., Sachdeva, N., Dhillon, I., Blistein, M., Ram, O., Zhang, D., Rosen, E., et al.: Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. arXiv preprint arXiv:2507.06261 (2025)

  11. [11]

    Emu: Enhanc- ing image generation models using photogenic needles in a haystack

    Dai, X., Hou, J., Ma, C.Y., Tsai, S., Wang, J., Wang, R., Zhang, P., Vandenhende, S., Wang, X., Dubey, A., et al.: Emu: Enhancing image generation models using photogenic needles in a haystack. arXiv:2309.15807 (2023)

  12. [12]

    Emerging Properties in Unified Multimodal Pretraining

    Deng, C., Zhu, D., Li, K., Gou, C., Li, F., Wang, Z., Zhong, S., Yu, W., Nie, X., Song, Z., et al.: Emerging properties in unified multimodal pretraining. arXiv preprint arXiv:2505.14683 (2025)

  13. [13]

    In: ICLR (2021)

    Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021)

  14. [14]

    The Llama 3 Herd of Models

    Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Yang, A., Fan, A., et al.: The llama 3 herd of mod- els. arXiv:2407.21783 (2024)

  15. [15]

    In: ICML (2024)

    Esser, P., Kulal, S., Blattmann, A., Entezari, R., Müller, J., Saini, H., Levi, Y., Lorenz, D., Sauer, A., Boesel, F., et al.: Scaling rectified flow transformers for high-resolution image synthesis. In: ICML (2024)

  16. [16]

    Seed-x: Multi- modal models with unified multi-granularity comprehension and generation.arXiv preprint arXiv:2404.14396, 2024

    Ge,Y.,Zhao,S.,Zhu,J.,Ge,Y.,Yi,K.,Song,L.,Li,C.,Ding,X.,Shan,Y.:Seed-x: Multimodal models with unified multi-granularity comprehension and generation. arXiv preprint arXiv:2404.14396 (2024) 16 J. Guo et al

  17. [17]

    In: NeurIPS (2023)

    Ghosh, D., Hajishirzi, H., Schmidt, L.: Geneval: An object-focused framework for evaluating text-to-image alignment. In: NeurIPS (2023)

  18. [18]

    In: NeurIPS (2014)

    Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial nets. In: NeurIPS (2014)

  19. [19]

    IEEE Transactions on Circuits and Systems for Video Technology (2023)

    Guo, J., Manukyan, H., Yang, C., Wang, C., Khachatryan, L., Navasardyan, S., Song, S., Shi, H., Huang, G.: Faceclip: Facial image-to-video translation via a brief text description. IEEE Transactions on Circuits and Systems for Video Technology (2023)

  20. [20]

    In: CVPR (2024)

    Guo, J., Xu, X., Pu, Y., Ni, Z., Wang, C., Vasu, M., Song, S., Huang, G., Shi, H.: Smooth diffusion: Crafting smooth latent spaces in diffusion models. In: CVPR (2024)

  21. [21]

    In: CVPR (2025)

    Guo, J., Yan, C., Xu, X., Wang, Y., Wang, K., Huang, G., Shi, H.: Img: Calibrating diffusion models via implicit multimodal guidance. In: CVPR (2025)

  22. [22]

    In: NeurIPS (2020)

    Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. In: NeurIPS (2020)

  23. [23]

    In: NeurIPS Workshops (2021)

    Ho, J., Salimans, T.: Classifier-free diffusion guidance. In: NeurIPS Workshops (2021)

  24. [24]

    ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment

    Hu, X., Wang, R., Fang, Y., Fu, B., Cheng, P., Yu, G.: Ella: Equip diffusion models with llm for enhanced semantic alignment. arXiv:2403.05135 (2024)

  25. [25]

    Interleaving reasoning for better text-to-image generation.arXiv preprint arXiv:2509.06945, 2025

    Huang, W., Chen, S., Xie, Z., Cao, S., Tang, S., Shen, Y., Yin, Q., Hu, W., Wang, X., Tang, Y., et al.: Interleaving reasoning for better text-to-image generation. arXiv preprint arXiv:2509.06945 (2025)

  26. [26]

    GPT-4o System Card

    Hurst, A., Lerer, A., Goucher, A.P., Perelman, A., Ramesh, A., Clark, A., Os- trow, A., Welihinda, A., Hayes, A., Radford, A., et al.: Gpt-4o system card. arXiv preprint arXiv:2410.21276 (2024)

  27. [27]

    In: ICLR (2015)

    Kingma, D.P., Welling, M.: Auto-encoding variational bayes. In: ICLR (2015)

  28. [28]

    Labs, B.F.: Flux.https://blackforestlabs.ai/(2024)

  29. [29]

    World model on million-length video and language with blockwise ringattention

    Liu, H., Yan, W., Zaharia, M., Abbeel, P.: World model on million-length video and language with blockwise ringattention. arXiv preprint arXiv:2402.08268 (2024)

  30. [30]

    In: NeurIPS (2024)

    Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. In: NeurIPS (2024)

  31. [31]

    In: ICLR (2022)

    Liu, X., Gong, C., Liu, Q.: Flow straight and fast: Learning to generate and transfer data with rectified flow. In: ICLR (2022)

  32. [32]

    In: ICLR (2019)

    Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: ICLR (2019)

  33. [33]

    arXiv preprint arXiv:2509.18639 (2025)

    Lyu, Y., Wong, C.K., Liao, C., Jiang, L., Zheng, X., Lu, Z., Zhang, L., Hu, X.: Understanding-in-generation: Reinforcing generative capability of unified model via infusing understanding into generation. arXiv preprint arXiv:2509.18639 (2025)

  34. [34]

    Janusflow: Harmonizing autoregres- sion and rectified flow for unified multimodal understanding and generation.arXiv preprint arXiv:2411.07975, 2024

    Ma, Y., Liu, X., Chen, X., Liu, W., Wu, C., Wu, Z., Pan, Z., Xie, Z., Zhang, H., yu, X., Zhao, L., Wang, Y., Liu, J., Ruan, C.: Janusflow: Harmonizing autoregres- sion and rectified flow for unified multimodal understanding and generation. arXiv preprint arXiv:2411.07975 (2024)

  35. [35]

    In: ICML (2022)

    Nichol, A.Q., Dhariwal, P., Ramesh, A., Shyam, P., Mishkin, P., Mcgrew, B., Sutskever, I., Chen, M.: GLIDE: Towards photorealistic image generation and edit- ing with text-guided diffusion models. In: ICML (2022)

  36. [36]

    Transfer between Modalities with MetaQueries

    Pan, X., Shukla, S.N., Singh, A., Zhao, Z., Mishra, S.K., Wang, J., Xu, Z., Chen, J., Li, K., Juefei-Xu, F., Hou, J., Xie, S.: Transfer between modalities with meta- queries. arXiv preprint arXiv:2504.06256 (2025)

  37. [37]

    In: ICCV (2023)

    Peebles, W., Xie, S.: Scalable diffusion models with transformers. In: ICCV (2023)

  38. [38]

    In: ICLR (2023) Refinement via Regeneration 17

    Podell, D., English, Z., Lacey, K., Blattmann, A., Dockhorn, T., Müller, J., Penna, J., Rombach, R.: Sdxl: Improving latent diffusion models for high-resolution image synthesis. In: ICLR (2023) Refinement via Regeneration 17

  39. [39]

    Uni-cot: Towards unified chain-of-thought reasoning across text and vision.arXiv preprint arXiv:2508.05606,

    Qin, L., Gong, J., Sun, Y., Li, T., Yang, M., Yang, X., Qu, C., Tan, Z., Li, H.: Uni-cot: Towards unified chain-of-thought reasoning across text and vision. arXiv preprint arXiv:2508.05606 (2025)

  40. [40]

    In: CVPR (2025)

    Qu, L., Zhang, H., Liu, Y., Wang, X., Jiang, Y., Gao, Y., Ye, H., Du, D.K., Yuan, Z., Wu, X.: Tokenflow: Unified image tokenizer for multimodal understanding and generation. In: CVPR (2025)

  41. [41]

    In: ICML (2021)

    Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., Sutskever, I.: Learning transferable visual models from natural language supervision. In: ICML (2021)

  42. [42]

    JMLR (2020)

    Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., Liu, P.J.: Exploring the limits of transfer learning with a unified text-to-text transformer. JMLR (2020)

  43. [44]

    Hierarchical Text-Conditional Image Generation with CLIP Latents

    Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., Chen, M.: Hierarchical text- conditional image generation with clip latents. arXiv:2204.06125 (2022)

  44. [45]

    In: CVPR (2022)

    Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: CVPR (2022)

  45. [46]

    In: NeurIPS (2017)

    Tarvainen, A., Valpola, H.: Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. In: NeurIPS (2017)

  46. [47]

    Team,C.:Chameleon:Mixed-modalearly-fusionfoundationmodels.arXivpreprint arXiv:2405.09818 (2024)

  47. [48]

    In: ICCV (2025)

    Wang, C., Lu, G., Yang, J., Huang, R., Han, J., Hou, L., Zhang, W., Xu, H.: Illume: Illuminating your llms to see, draw, and self-enhance. In: ICCV (2025)

  48. [49]

    In: ICML (2025)

    Wang, J., Pu, J., Qi, Z., Guo, J., Ma, Y., Huang, N., Chen, Y., Li, X., Shan, Y.: Taming rectified flow for inversion and editing. In: ICML (2025)

  49. [50]

    Promptenhancer: A simple approach to enhance text-to-image models via chain-of-thought prompt rewriting

    Wang, L., Xing, X., Cheng, Y., Zhao, Z., Li, D., Hang, T., Tao, J., Wang, Q., Li, R., Chen, C., et al.: Promptenhancer: A simple approach to enhance text-to-image models via chain-of-thought prompt rewriting. arXiv preprint arXiv:2509.04545 (2025)

  50. [51]

    Emu3: Next-Token Prediction is All You Need

    Wang, X., Zhang, X., Luo, Z., Sun, Q., Cui, Y., Wang, J., Zhang, F., Wang, Y., Li, Z., Yu, Q., et al.: Emu3: Next-token prediction is all you need. arXiv preprint arXiv:2409.18869 (2024)

  51. [52]

    arXiv preprint arXiv:2510.18701 , year=

    Wang, Y., Li, Z., Zang, Y., Bu, J., Zhou, Y., Xin, Y., He, J., Wang, C., Lu, Q., Jin, C., et al.: Unigenbench++: A unified semantic evaluation benchmark for text- to-image generation. arXiv preprint arXiv:2510.18701 (2025)

  52. [53]

    Qwen-Image Technical Report

    Wu, C., Li, J., Zhou, J., Lin, J., Gao, K., Yan, K., Yin, S.m., Bai, S., Xu, X., Chen, Y., et al.: Qwen-image technical report. arXiv preprint arXiv:2508.02324 (2025)

  53. [54]

    Janus: Decoupling visual encoding for unified multimodal understanding and generation.arXiv preprint arXiv:2410.13848, 2024

    Wu, C., Chen, X., Wu, Z., Ma, Y., Liu, X., Pan, Z., Liu, W., Xie, Z., Yu, X., Ruan, C., et al.: Janus: Decoupling visual encoding for unified multimodal understanding and generation. arXiv preprint arXiv:2410.13848 (2024)

  54. [55]

    In: CVPR (2024)

    Wu, T.H., Lian, L., Gonzalez, J.E., Li, B., Darrell, T.: Self-correcting llm-controlled diffusion models. In: CVPR (2024)

  55. [56]

    In: ICLR (2025)

    Xie, J., Mao, W., Bai, Z., Zhang, D.J., Wang, W., Lin, K.Q., Gu, Y., Chen, Z., Yang, Z., Shou, M.Z.: Show-o: One single transformer to unify multimodal under- standing and generation. In: ICLR (2025)

  56. [57]

    Show-o2: Improved Native Unified Multimodal Models

    Xie,J.,Yang,Z.,Shou,M.Z.:Show-o2:Improvednativeunifiedmultimodalmodels. arXiv preprint arXiv:2506.15564 (2025)

  57. [58]

    In: CVPR (2024) 18 J

    Xu, X., Guo, J., Wang, Z., Huang, G., Essa, I., Shi, H.: Prompt-free diffusion: Taking" text" out of text-to-image diffusion models. In: CVPR (2024) 18 J. Guo et al

  58. [59]

    In: ICML (2024)

    Yang, L., Yu, Z., Meng, C., Xu, M., Ermon, S., Bin, C.: Mastering text-to-image diffusion: Recaptioning, planning, and generating with multimodal llms. In: ICML (2024)

  59. [60]

    In: CVPR (2025)

    Yu, Q., Chow, W., Yue, Z., Pan, K., Wu, Y., Wan, X., Li, J., Tang, S., Zhang, H., Zhuang, Y.: Anyedit: Mastering unified high-quality image editing for any idea. In: CVPR (2025)

  60. [61]

    In: NeurIPS (2023)

    Zhang, K., Mo, L., Chen, W., Sun, H., Su, Y.: Magicbrush: A manually annotated dataset for instruction-guided image editing. In: NeurIPS (2023)

  61. [62]

    In: NeurIPS (2024)

    Zhao, H., Ma, X.S., Chen, L., Si, S., Wu, R., An, K., Yu, P., Zhang, M., Li, Q., Chang, B.: Ultraedit: Instruction-based fine-grained image editing at scale. In: NeurIPS (2024)

  62. [63]

    Zhou, C., Yu, L., Babu, A., Tirumala, K., Yasunaga, M., Shamis, L., Kahn, J., Ma, X., Zettlemoyer, L., Levy, O.: Transfusion: Predict the next token and diffuse images with one multi-modal model. In: ICLR (2025) Refinement via Regeneration 1 Supplementary Materials A Attention Mask We adopt the standard omni-attention mechanism [56] in UMMs to support RvR...