pith. sign in

arxiv: 2606.04264 · v1 · pith:2GO4UHIInew · submitted 2026-06-02 · 💻 cs.CV

UniCanvas: A Diffusion-base Unified Model for Text-in-Image Joint Generation

Pith reviewed 2026-06-28 10:21 UTC · model grok-4.3

classification 💻 cs.CV
keywords diffusion modelstext-in-image generationunified multimodal generationvision-language modelspixel canvasinterleaved content
0
0 comments X

The pith

UniCanvas generates text and images together by rendering language as visual patterns on a shared pixel canvas.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces UniCanvas as a diffusion-based approach to unified multimodal generation. It embeds language directly as visual patterns within images rather than using separate text tokens, allowing the model to draw both modalities on one pixel canvas. This design leverages diffusion models' strength in image synthesis while addressing their weakness in producing coherent text. A sympathetic reader would see this as a way to create a single architecture that handles interleaved text and visual content without switching between autoregressive and diffusion mechanisms.

Core claim

UniCanvas unifies diffusion models to generate interleaved multimodal contents through text-in-image generation, where the model learns to represent language as visual patterns inside images on a shared pixel canvas instead of producing discrete text tokens.

What carries the argument

The shared pixel canvas on which the diffusion model generates both images and text rendered as visual patterns.

If this is right

  • The model produces coherent text embedded within images as part of a single synthesis process.
  • Diffusion models become viable for unified multimodal generation without autoregressive components.
  • Text-in-image generation establishes a new paradigm that improves performance over prior unified vision-language models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same canvas-based approach could be tested on generating structured visual elements such as charts or diagrams alongside natural images.
  • If successful at scale, this method might reduce the need for hybrid architectures that combine separate text and image generators.
  • Extensions to longer interleaved sequences could test whether the pixel-canvas representation maintains coherence across multiple sentences of embedded text.

Load-bearing premise

Representing language as visual patterns inside images on a shared pixel canvas enables the diffusion model to generate coherent text without separate discrete token mechanisms.

What would settle it

A side-by-side evaluation in which text rendered inside generated images is measurably less readable or coherent than text produced by models that use explicit token prediction would falsify the central claim.

Figures

Figures reproduced from arXiv: 2606.04264 by Chuang Gan, Hao-Wei Chen, Haoyu Zhen, Maohao Shen, Xueyang Yu, Yuncong Yang, Zeyuan Yang, Ziqiao Ma.

Figure 1
Figure 1. Figure 1: Sequential multimodal reasoning across diverse tasks. [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overall Architecture of UniCanvas. Our model encodes multimodal inputs and input noise through a frozen tokenizer and VAE encoder, then processes them with stacked DiT blocks. A frozen VAE decoder reconstructs the predicted image, and CLIP￾based losses enforce alignment with the ground-truth textual condition. pathways. Whether these two capabilities can truly co-exist and mutually en￾hance each other rema… view at source ↗
Figure 3
Figure 3. Figure 3: Two-stage Canvas Update in UniCanvas. In the first stage, the model “writes” the semantic texts on the canvas, while in the second stage, the modal updates the scene image conditioned on the enriched canvas. Iterative interleave generation al￾ternates between these two stages to produce coherent step-wise multimodal sequences. 3.1 Text-in-Image Generation In text-in-image generation, UniCanvas learns to “w… view at source ↗
Figure 4
Figure 4. Figure 4: Comparison with Nano-Banana on VSP. Qualitative comparison of action-sequence generation. Our two-stage approach produces more consistent navi￾gation trajectories, while Nano-Banana shows unstable plans and generation in both its one-stage and two-stage variants. 4.2 Experimental Results Long-Horizon cross-Modal planning. The VSP planning task serves as a challenging cross-modal reasoning benchmark that re… view at source ↗
Figure 5
Figure 5. Figure 5: Comparison with Nano-Banana on RLBench. Qualitative comparison of action￾sequence generation. Our model yields clearer action intent and more accurate scene transitions, while both Nano-Banana variants often produce ambigu￾ous or inconsistent results. Qualitative comparisons on visual quality. To further assess the qualitative behav￾ior of our model, we com￾pare UniCanvas with Nano￾Banana. In [PITH_FULL_I… view at source ↗
Figure 6
Figure 6. Figure 6: Failure case of long-horizon rollouts. UniCanvas becomes unreliable at predicting the correct next-state image when evaluated in the long sequence scenarios. Once the predicted state deviates, subsequent rollouts follow an incorrect trajectory. conditioning for both the reasoning prompt (creason), and the execution prompt (cexec) is handled through the tokenizer used in Qwen2.5-VL, which ensures ro￾bust al… view at source ↗
Figure 7
Figure 7. Figure 7: Failure case of text-in-image generation. [PITH_FULL_IMAGE:figures/full_fig_p020_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Self-correction during long-horizon rollouts. [PITH_FULL_IMAGE:figures/full_fig_p021_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Quantitative results on general visual reasoning. [PITH_FULL_IMAGE:figures/full_fig_p022_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Representative failure patterns of text generation. [PITH_FULL_IMAGE:figures/full_fig_p022_10.png] view at source ↗
read the original abstract

Recent years have seen remarkable progress in unified vision-language models handling both multimodal understanding and generation within a single architecture. While autoregressive VLMs can reason across modalities, they fail to generate high-quality images. In contrast, diffusion models produce photorealistic visuals yet struggle to generate coherent text, making it challenging to develop a single unified model that can seamlessly handle both visual and text generation. Recent advances suggest that language can be effectively embedded within visual representations, allowing models to reason about textual semantics directly from images. To this end, we propose UniCanvas, a first attempt that unifies diffusion models to generate interleaved multimodal contents through text-in-image generation. Diffusion models naturally capture transformations on a shared pixel canvas, which can be viewed as world models of visual change. Instead of producing discrete text tokens, the model learns to represent language as visual patterns inside images, leveraging its inherent multimodal embedding space. This design allows the model to "draw" text naturally within a single pixel canvas during image synthesis, achieving seamless multimodal generation. Experiments demonstrate that UniCanvas improves performance over previous unified models, positioning text-in-image generation with diffusion models as a promising unified multimodal generation paradigm.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The manuscript proposes UniCanvas, a diffusion-based model for unified multimodal generation that embeds language as visual patterns on a shared pixel canvas instead of using discrete text tokens. This design is intended to allow a single diffusion model to generate interleaved text and images by treating text generation as a visual synthesis task. The abstract asserts that experiments show performance improvements over prior unified models and positions the approach as a promising paradigm.

Significance. If the empirical claims are substantiated, the work could meaningfully advance unified vision-language generation by removing the need for separate autoregressive text mechanisms and leveraging diffusion models' existing pixel-canvas operations, potentially simplifying architectures for joint multimodal output.

major comments (1)
  1. [Abstract] Abstract: the central claim that 'Experiments demonstrate that UniCanvas improves performance over previous unified models' is unsupported by any metrics, baselines, tables, figures, or experimental details in the manuscript, making the primary empirical assertion impossible to evaluate.
minor comments (1)
  1. [Title] Title contains a clear typo: 'Diffusion-base' should read 'Diffusion-based'.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their careful reading and for highlighting this issue with the abstract. We address the comment point-by-point below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that 'Experiments demonstrate that UniCanvas improves performance over previous unified models' is unsupported by any metrics, baselines, tables, figures, or experimental details in the manuscript, making the primary empirical assertion impossible to evaluate.

    Authors: We agree that the current version of the manuscript does not contain the metrics, baselines, tables, or figures needed to substantiate the performance claim made in the abstract. In the revised manuscript we will remove the unsupported empirical assertion from the abstract (or qualify it as a direction for future work) until the corresponding experimental results can be included. revision: yes

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper introduces UniCanvas as a conceptual architecture for unified multimodal generation by embedding text as visual patterns on a shared pixel canvas within a diffusion model. No equations, parameter fittings, predictions derived from inputs, or self-citation chains appear in the provided abstract or described structure. The central claim rests on the design choice and subsequent empirical experiments, which are presented as falsifiable performance improvements rather than any self-referential derivation or renaming of known results. The derivation chain is self-contained at the level of a high-level proposal without reductions to fitted inputs or prior author work by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities; the proposal relies on high-level conceptual assumptions about visual embedding of language.

pith-pipeline@v0.9.1-grok · 5758 in / 975 out tokens · 22002 ms · 2026-06-28T10:21:48.519086+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

69 extracted references · 52 canonical work pages · 24 internal anchors

  1. [2]

    Qwen2.5-VL Technical Report

    Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., et al.: Qwen2.5-vl technical report. arXiv preprint arXiv:2502.13923 (2025)

  2. [3]

    In: Proceedings of the Computer Vision and Pattern Recognition Con- ference

    Bar, A., Zhou, G., Tran, D., Darrell, T., LeCun, Y.: Navigation world mod- els. In: Proceedings of the Computer Vision and Pattern Recognition Con- ference. pp. 15791–15801 (2025)

  3. [4]

    arXiv preprint arXiv:2602.02227 (2026)

    Chen, H.H., Yin, X., Shu, W.J., Zhang, H., Zhang, Z., Liao, C., Guo, L., Chen, Q., Chen, Y.C.: Show, don’t tell: Morphing latent reasoning into image generation. arXiv preprint arXiv:2602.02227 (2026)

  4. [5]

    BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset

    Chen, J., Xu, Z., Pan, X., Hu, Y., Qin, C., Goldstein, T., Huang, L., Zhou, T., Xie, S., Savarese, S., et al.: Blip3-o: A family of fully open uni- fied multimodal models-architecture, training and dataset. arXiv preprint arXiv:2505.09568 (2025)

  5. [6]

    PixArt-$\alpha$: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis

    Chen, J., Yu, J., Ge, C., Yao, L., Xie, E., Wu, Y., Wang, Z., Kwok, J., Luo, P., Lu, H., et al.: Pixart: Fast training of diffusion transformer for photore- alistic text-to-image synthesis. arXiv preprint arXiv:2310.00426 (2023)

  6. [7]

    arXiv preprint arXiv:2407.06135 (2024)

    Chern, E., Su, J., Ma, Y., Liu, P.: Anole: An open, autoregressive, na- tive large multimodal models for interleaved image-text generation. arXiv preprint arXiv:2407.06135 (2024)

  7. [8]

    Emerging Properties in Unified Multimodal Pretraining

    Deng, C., Zhu, D., Li, K., Gou, C., Li, F., Wang, Z., Zhong, S., Yu, W., Nie, X., Song, Z., et al.: Emerging properties in unified multimodal pretraining. arXiv preprint arXiv:2505.14683 (2025)

  8. [9]

    Advances in neural information processing systems34, 8780–8794 (2021)

    Dhariwal, P., Nichol, A.: Diffusion models beat gans on image synthesis. Advances in neural information processing systems34, 8780–8794 (2021)

  9. [10]

    arXiv preprint arXiv:2309.11499 (2023)

    Dong, R., Han, C., Peng, Y., Qi, Z., Ge, Z., Yang, J., Zhao, L., Sun, J., Zhou, H., Wei, H., et al.: Dreamllm: Synergistic multimodal comprehension and creation. arXiv preprint arXiv:2309.11499 (2023)

  10. [11]

    GoT-R1: Unleashing Reasoning Capability of MLLM for Visual Generation with Reinforcement Learning

    Duan, C., Fang, R., Wang, Y., Wang, K., Huang, L., Zeng, X., Li, H., Liu, X.: Got-r1: Unleashing reasoning capability of mllm for visual generation with reinforcement learning. arXiv preprint arXiv:2505.17022 (2025)

  11. [12]

    In: Forty-first international confer- ence on machine learning (2024)

    Esser, P., Kulal, S., Blattmann, A., Entezari, R., Müller, J., Saini, H., Levi, Y., Lorenz, D., Sauer, A., Boesel, F., et al.: Scaling rectified flow transform- ers for high-resolution image synthesis. In: Forty-first international confer- ence on machine learning (2024)

  12. [13]

    arXiv preprint arXiv:2405.05945 (2024) UniCanvas: Diffusion-base Unified Model for Text-in-Image Joint Generation 25

    Gao, P., Zhuo, L., Liu, D., Du, R., Luo, X., Qiu, L., Zhang, Y., Lin, C., Huang, R., Geng, S., et al.: Lumina-t2x: Transforming text into any modal- ity, resolution, and duration via flow-based large diffusion transformers. arXiv preprint arXiv:2405.05945 (2024) UniCanvas: Diffusion-base Unified Model for Text-in-Image Joint Generation 25

  13. [14]

    arXiv preprint arXiv:2503.18938 (2025)

    Gao, S., Zhou, S., Du, Y., Zhang, J., Gan, C.: Adaworld: Learning adaptable world models with latent actions. arXiv preprint arXiv:2503.18938 (2025)

  14. [15]

    ThinkMorph: Emergent properties in multimodal interleaved chain-of-thought reasoning.arXiv preprint arXiv:2510.27492, 2025

    Gu, J., Hao, Y., Wang, H.W., Li, L., Shieh, M.Q., Choi, Y., Krishna, R., Cheng, Y.: Thinkmorph: Emergent properties in multimodal interleaved chain-of-thought reasoning. arXiv preprint arXiv:2510.27492 (2025)

  15. [16]

    arXiv preprint arXiv:2511.16671 (2025)

    Guo, Z., Zhang, R., Li, H., Zhang, M., Chen, X., Wang, S., Feng, Y., Pei, P., Heng, P.A.: Thinking-while-generating: Interleaving textual reasoning throughout visual generation. arXiv preprint arXiv:2511.16671 (2025)

  16. [17]

    arXiv preprint arXiv:2501.13926 (2025)

    Guo, Z., Zhang, R., Tong, C., Zhao, Z., Huang, R., Zhang, H., Zhang, M., Liu, J., Zhang, S., Gao, P., et al.: Can we generate images with cot? let’s verify and reinforce image generation step by step. arXiv preprint arXiv:2501.13926 (2025)

  17. [18]

    World Models

    Ha, D., Schmidhuber, J.: World models. arXiv preprint arXiv:1803.10122 2(3), 440 (2018)

  18. [19]

    Dream to Control: Learning Behaviors by Latent Imagination

    Hafner, D., Lillicrap, T., Ba, J., Norouzi, M.: Dream to control: Learning behaviors by latent imagination. arXiv preprint arXiv:1912.01603 (2019)

  19. [20]

    arXiv preprint arXiv:2601.03193 (2026)

    Han, R., Fang, Z., Sun, X., Ma, Y., Wang, Z., Zeng, Y., Chen, Z., Chen, L., Huang, W., Xu, W.J., et al.: Unicorn: Towards self-improving uni- fied multimodal models through self-generated supervision. arXiv preprint arXiv:2601.03193 (2026)

  20. [21]

    In: Advances in Neural Information Processing Systems (2020)

    Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. In: Advances in Neural Information Processing Systems (2020)

  21. [22]

    Classifier-Free Diffusion Guidance

    Ho, J., Salimans, T.: Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598 (2022)

  22. [23]

    IEEE Robotics and Automation Letters5(2), 3019–3026 (2020)

    James, S., Ma, Z., Arrojo, D.R., Davison, A.J.: Rlbench: The robot learn- ing benchmark & learning environment. IEEE Robotics and Automation Letters5(2), 3019–3026 (2020)

  23. [24]

    arXiv preprint arXiv:2505.00703 (2025)

    Jiang, D., Guo, Z., Zhang, R., Zong, Z., Li, H., Zhuo, L., Yan, S., Heng, P.A., Li, H.: T2i-r1: Reinforcing image generation with collaborative semantic- level and token-level cot. arXiv preprint arXiv:2505.00703 (2025)

  24. [25]

    arXiv preprint arXiv:2512.05112 (2025)

    Jiang, D., Zhang, R., Li, H., Zong, Z., Guo, Z., He, J., Guo, C., Ye, J., Fang, R., Li, W., et al.: Draco: Draft as cot for text-to-image preview and rare concept generation. arXiv preprint arXiv:2512.05112 (2025)

  25. [26]

    arXiv preprint arXiv:2505.17534 (2025)

    Jiang, J., Si, C., Luo, J., Zhang, H., Ma, C.: Co-reinforcement learn- ing for unified multimodal understanding and generation. arXiv preprint arXiv:2505.17534 (2025)

  26. [27]

    Advances in neural information process- ing systems35, 26565–26577 (2022)

    Karras, T., Aittala, M., Aila, T., Laine, S.: Elucidating the design space of diffusion-based generative models. Advances in neural information process- ing systems35, 26565–26577 (2022)

  27. [28]

    Advances in neural information processing systems34, 21696–21707 (2021)

    Kingma, D., Salimans, T., Poole, B., Ho, J.: Variational diffusion models. Advances in neural information processing systems34, 21696–21707 (2021)

  28. [29]

    Imagine while Reasoning in Space: Multimodal Visualization-of-Thought

    Li, C., Wu, W., Zhang, H., Xia, Y., Mao, S., Dong, L., Vulić, I., Wei, F.: Imagine while reasoning in space: Multimodal visualization-of-thought. arXiv preprint arXiv:2501.07542 (2025)

  29. [30]

    Causal World Modeling for Robot Control

    Li, L., Zhang, Q., Luo, Y., Yang, S., Wang, R., Han, F., Yu, M., Gao, Z., Xue, N., Zhu, X., Shen, Y., Xu, Y.: Causal world modeling for robot control. arXiv preprint arXiv:2601.21998 (2026) 26 Z. Yang et al

  30. [31]

    Genie Envisioner: A Unified World Foundation Platform for Robotic Manipulation

    Liao, Y., Zhou, P., Huang, S., Yang, D., Chen, S., Jiang, Y., Hu, Y., Cai, J., Liu, S., Luo, J., et al.: Genie envisioner: A unified world foundation platform for robotic manipulation. arXiv preprint arXiv:2508.05635 (2025)

  31. [32]

    In: NeurIPS (2023)

    Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. In: NeurIPS (2023)

  32. [33]

    arXiv preprint arXiv:2502.20321 (2025)

    Ma, C., Jiang, Y., Wu, J., Yang, J., Yu, X., Yuan, Z., Peng, B., Qi, X.: Unitok: A unified tokenizer for visual generation and understanding. arXiv preprint arXiv:2502.20321 (2025)

  33. [34]

    arXiv preprint arXiv:2411.07975 (2024)

    Ma, Y., Liu, X., Chen, X., Liu, W., Wu, C., Wu, Z., Pan, Z., Xie, Z., Zhang, H., Zhao, L., et al.: Janusflow: Harmonizing autoregression and rectified flow for unified multimodal understanding and generation. arXiv preprint arXiv:2411.07975 (2024)

  34. [35]

    In: International conference on machine learning

    Nichol, A.Q., Dhariwal, P.: Improved denoising diffusion probabilistic mod- els. In: International conference on machine learning. pp. 8162–8171. PMLR (2021)

  35. [36]

    SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

    Podell, D., English, Z., Lacey, K., Blattmann, A., Dockhorn, T., Müller, J., Penna, J., Rombach, R.: Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952 (2023)

  36. [37]

    arXiv preprint arXiv:2510.07313 (2025)

    Qian, Z., Chi, X., Li, Y., Wang, S., Qin, Z., Ju, X., Han, S., Zhang, S.: Wristworld: Generating wrist-views via 4d world models for robotic manip- ulation. arXiv preprint arXiv:2510.07313 (2025)

  37. [38]

    arXiv preprint arXiv:2508.05606 (2025)

    Qin, L., Gong, J., Sun, Y., Li, T., Yang, M., Yang, X., Qu, C., Tan, Z., Li, H.: Uni-cot: Towards unified chain-of-thought reasoning across text and vision. arXiv preprint arXiv:2508.05606 (2025)

  38. [39]

    Hierarchical Text-Conditional Image Generation with CLIP Latents

    Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., Chen, M.: Hierarchi- cal text-conditional image generation with clip latents. arXiv preprint arXiv:2204.061251(2), 3 (2022)

  39. [40]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High- resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10684–10695 (2022)

  40. [41]

    Advances in neural information processing systems35, 36479– 36494 (2022)

    Saharia, C., Chan, W., Saxena, S., Li, L., Whang, J., Denton, E.L., Ghasemipour, K., Gontijo Lopes, R., Karagol Ayan, B., Salimans, T., et al.: Photorealistic text-to-image diffusion models with deep language under- standing. Advances in neural information processing systems35, 36479– 36494 (2022)

  41. [42]

    Muddit: Liberating Generation Beyond Text-to-Image with a Unified Discrete Diffusion Model

    Shi, Q., Bai, J., Zhao, Z., Chai, W., Yu, K., Wu, J., Song, S., Tong, Y., Li, X., Li, X., et al.: Muddit: Liberating generation beyond text-to-image with a unified discrete diffusion model. arXiv:2505.23606 (2025)

  42. [43]

    In: Proceedings of the 32nd International Conference on Machine Learning

    Sohl-Dickstein,J.,Weiss,E.A.,Maheswaranathan,N.,Ganguli,S.:Deepun- supervised learning using nonequilibrium thermodynamics. In: Proceedings of the 32nd International Conference on Machine Learning. pp. 2256–2265 (2015)

  43. [44]

    Advances in neural information processing systems34, 1415–1428 (2021) UniCanvas: Diffusion-base Unified Model for Text-in-Image Joint Generation 27

    Song, Y., Durkan, C., Murray, I., Ermon, S.: Maximum likelihood training of score-based diffusion models. Advances in neural information processing systems34, 1415–1428 (2021) UniCanvas: Diffusion-base Unified Model for Text-in-Image Joint Generation 27

  44. [45]

    Score-Based Generative Modeling through Stochastic Differential Equations

    Song,Y.,Sohl-Dickstein,J.,Kingma,D.P.,Kumar,A.,Ermon,S.,Poole,B.: Score-based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456 (2020)

  45. [46]

    Generation Enhances Understanding in Unified Multimodal Models via Multi-Representation Generation

    Su, Z., Wei, H., Cen, K., Wang, Y., Chen, G., Yuan, C., Chu, X.: Gen- eration enhances understanding in unified multimodal models via multi- representation generation. arXiv preprint arXiv:2601.21406 (2026)

  46. [47]

    Chameleon: Mixed-Modal Early-Fusion Foundation Models

    Team, C.: Chameleon: Mixed-modal early-fusion foundation models, 2024. URL https://arxiv. org/abs/2405.098189(8) (2024)

  47. [48]

    Next-Latent Prediction Transformers Learn Compact World Models

    Teoh, J., Tomar, M., Ahn, K., Hu, E.S., Sharma, P., Islam, R., Lamb, A., Langford, J.: Next-latent prediction transformers learn compact world models. arXiv preprint arXiv:2511.05963 (2025)

  48. [49]

    arXiv preprint arXiv:2401.10208 (2024)

    Tian, C., Zhu, X., Xiong, Y., Wang, W., Chen, Z., Wang, W., Chen, Y., Lu, L., Lu, T., Zhou, J., et al.: Mm-interleaved: Interleaved image-text generative modeling via multi-modal feature synchronizer. arXiv preprint arXiv:2401.10208 (2024)

  49. [50]

    MetaMorph: Multimodal Understanding and Generation via Instruction Tuning

    Tong, S., Fan, D., Zhu, J., Xiong, Y., Chen, X., Sinha, K., Rabbat, M., LeCun, Y., Xie, S., Liu, Z.: Metamorph: Multimodal understanding and generation via instruction tuning. arXiv preprint arXiv:2412.14164 (2024)

  50. [51]

    arXiv preprint arXiv:2602.01382 (2026)

    Wang,F.Y.,Zhang,H.,Gharbi,M.,Li,H.,Park,T.:Promptrl:Promptmat- ters in rl for flow-based image generation. arXiv preprint arXiv:2602.01382 (2026)

  51. [52]

    arXiv:2505.20147 (2025)

    Wang, J., Lai, Y., Li, A., Zhang, S., Sun, J., Kang, N., Wu, C., Li, Z., Luo, P.: Fudoki: Discrete flow-based unified understanding and generation via kinetic-optimal velocities. arXiv:2505.20147 (2025)

  52. [53]

    arXiv preprint arXiv:2411.07199 (2024)

    Wei, C., Xiong, Z., Ren, W., Du, X., Zhang, G., Chen, W.: Omniedit: Build- ing image editing generalist models through specialist supervision. arXiv preprint arXiv:2411.07199 (2024)

  53. [54]

    arXiv preprint arXiv:2601.19834 (2026)

    Wu, J., Zhang, X., Yuan, H., Zhang, X., Huang, T., He, C., Deng, C., Zhang, R., Wu, Y., Long, M.: Visual generation unlocks human-like rea- soning through multimodal world models. arXiv preprint arXiv:2601.19834 (2026)

  54. [55]

    arXiv preprint arXiv:2407.01863 (2024)

    Wu, Q., Zhao, H., Saxon, M., Bui, T., Wang, W.Y., Zhang, Y., Chang, S.: Vsp: Assessing the dual challenges of perception and reasoning in spatial planning tasks for vlms. arXiv preprint arXiv:2407.01863 (2024)

  55. [56]

    VILA-U: a Unified Foundation Model Integrating Visual Understanding and Generation

    Wu, Y., Zhang, Z., Chen, J., Tang, H., Li, D., Fang, Y., Zhu, L., Xie, E., Yin, H., Yi, L., et al.: Vila-u: a unified foundation model integrating visual understanding and generation. arXiv preprint arXiv:2409.04429 (2024)

  56. [57]

    arXiv preprint arXiv:2505.13031 (2025)

    Xiao, Y., Song, L., Chen, Y., Luo, Y., Chen, Y., Gan, Y., Huang, W., Li, X., Qi, X., Shan, Y.: Mindomni: Unleashing reasoning generation in vision language models with rgpo. arXiv preprint arXiv:2505.13031 (2025)

  57. [58]

    In: The Thirteenth International Conference on Learning Representations (2025)

    Xie, J., Mao, W., Bai, Z., Zhang, D.J., Wang, W., Lin, K.Q., Gu, Y., Chen, Z., Yang, Z., Shou, M.Z.: Show-o: One single transformer to unify mul- timodal understanding and generation. In: The Thirteenth International Conference on Learning Representations (2025)

  58. [59]

    arXiv preprint arXiv:2505.11409 (2025) 28 Z

    Xu, Y., Li, C., Zhou, H., Wan, X., Zhang, C., Korhonen, A., Vulić, I.: Visual planning: Let’s think only with images. arXiv preprint arXiv:2505.11409 (2025) 28 Z. Yang et al

  59. [60]

    MMaDA: Multimodal Large Diffusion Language Models

    Yang, L., Tian, Y., Li, B., Zhang, X., Shen, K., Tong, Y., Wang, M.: Mmada: Multimodal large diffusion language models. arXiv preprint arXiv:2505.15809 (2025)

  60. [61]

    arXiv preprint arXiv:2507.12508 (2025)

    Yang, Y., Liu, J., Zhang, Z., Zhou, S., Tan, R., Yang, J., Du, Y., Gan, C.: Mindjourney: Test-time scaling with world models for spatial reasoning. arXiv preprint arXiv:2507.12508 (2025)

  61. [62]

    Machine Mental Imagery: Empower Multimodal Reasoning with Latent Visual Tokens

    Yang, Z., Yu, X., Chen, D., Shen, M., Gan, C.: Machine mental imagery: Empower multimodal reasoning with latent visual tokens. arXiv preprint arXiv:2506.17218 (2025)

  62. [63]

    Ye, S., Ge, Y., Zheng, K., Gao, S., Yu, S., Kurian, G., Indupuru, S., Tan, Y.L., Zhu, C., Xiang, J., Malik, A., Lee, K., Liang, W., Ranawaka, N., Gu, J., Xu, Y., Wang, G., Hu, F., Narayan, A., Bjorck, J., Wang, J., Kim, G., Niu, D., Zheng, R., Xie, Y., Wu, J., Wang, Q., Julian, R., Xu, D., Du, Y., Chebotar, Y., Reed, S., Kautz, J., Zhu, Y., Fan, L.J., Jan...

  63. [64]

    arXiv preprint arXiv:2511.22625 (2025)

    Yin, F., Liu, S., Han, Y., Wang, Z., Xing, P., Wang, R., Cheng, W., Wang, Y., Li, A., Yin, Z., et al.: Reasonedit: Towards reasoning-enhanced image editing models. arXiv preprint arXiv:2511.22625 (2025)

  64. [65]

    In: Proceedings of the 33rd ACM International Conference on Mul- timedia

    Zeng, B., Yang, L., Liu, J., Xu, M., Zhang, Y., Wan, P., Zhang, W., Yan, S.: Editworld: Simulating world dynamics for instruction-following image editing. In: Proceedings of the 33rd ACM International Conference on Mul- timedia. pp. 12674–12681 (2025)

  65. [66]

    In: CVPR (2018)

    Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreason- able effectiveness of deep features as a perceptual metric. In: CVPR (2018)

  66. [67]

    arXiv preprint arXiv:2602.12322 (2026)

    Zhang, Z., Yang, S., Hu, Q., Huang, L.J., Hou, J., Sun, Y., Lu, Y., Han, S.: Foreact: Steering your vla with efficient visual foresight planning. arXiv preprint arXiv:2602.12322 (2026)

  67. [68]

    3D-VLA: A 3D Vision-Language-Action Generative World Model

    Zhen, H., Qiu, X., Chen, P., Yang, J., Yan, X., Du, Y., Hong, Y., Gan, C.: 3d-vla: A 3d vision-language-action generative world model. arXiv preprint arXiv:2403.09631 (2024)

  68. [69]

    arXiv preprint arXiv:2504.20995 (2025)

    Zhen, H., Sun, Q., Zhang, H., Li, J., Zhou, S., Du, Y., Gan, C.: Tesser- act: Learning 4d embodied world models. arXiv preprint arXiv:2504.20995 (2025)

  69. [70]

    In: The Thirteenth Inter- national Conference on Learning Representations (2025)

    Zhou, C., Yu, L., Babu, A., Tirumala, K., Yasunaga, M., Shamis, L., Kahn, J., Ma, X., Zettlemoyer, L., Levy, O.: Transfusion: Predict the next token and diffuse images with one multi-modal model. In: The Thirteenth Inter- national Conference on Learning Representations (2025)