UniCanvas: A Diffusion-base Unified Model for Text-in-Image Joint Generation

Chuang Gan; Hao-Wei Chen; Haoyu Zhen; Maohao Shen; Xueyang Yu; Yuncong Yang; Zeyuan Yang; Ziqiao Ma

arxiv: 2606.04264 · v1 · pith:2GO4UHIInew · submitted 2026-06-02 · 💻 cs.CV

UniCanvas: A Diffusion-base Unified Model for Text-in-Image Joint Generation

Zeyuan Yang , Hao-Wei Chen , Xueyang Yu , Yuncong Yang , Haoyu Zhen , Ziqiao Ma , Maohao Shen , Chuang Gan This is my paper

Pith reviewed 2026-06-28 10:21 UTC · model grok-4.3

classification 💻 cs.CV

keywords diffusion modelstext-in-image generationunified multimodal generationvision-language modelspixel canvasinterleaved content

0 comments

The pith

UniCanvas generates text and images together by rendering language as visual patterns on a shared pixel canvas.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces UniCanvas as a diffusion-based approach to unified multimodal generation. It embeds language directly as visual patterns within images rather than using separate text tokens, allowing the model to draw both modalities on one pixel canvas. This design leverages diffusion models' strength in image synthesis while addressing their weakness in producing coherent text. A sympathetic reader would see this as a way to create a single architecture that handles interleaved text and visual content without switching between autoregressive and diffusion mechanisms.

Core claim

UniCanvas unifies diffusion models to generate interleaved multimodal contents through text-in-image generation, where the model learns to represent language as visual patterns inside images on a shared pixel canvas instead of producing discrete text tokens.

What carries the argument

The shared pixel canvas on which the diffusion model generates both images and text rendered as visual patterns.

If this is right

The model produces coherent text embedded within images as part of a single synthesis process.
Diffusion models become viable for unified multimodal generation without autoregressive components.
Text-in-image generation establishes a new paradigm that improves performance over prior unified vision-language models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same canvas-based approach could be tested on generating structured visual elements such as charts or diagrams alongside natural images.
If successful at scale, this method might reduce the need for hybrid architectures that combine separate text and image generators.
Extensions to longer interleaved sequences could test whether the pixel-canvas representation maintains coherence across multiple sentences of embedded text.

Load-bearing premise

Representing language as visual patterns inside images on a shared pixel canvas enables the diffusion model to generate coherent text without separate discrete token mechanisms.

What would settle it

A side-by-side evaluation in which text rendered inside generated images is measurably less readable or coherent than text produced by models that use explicit token prediction would falsify the central claim.

Figures

Figures reproduced from arXiv: 2606.04264 by Chuang Gan, Hao-Wei Chen, Haoyu Zhen, Maohao Shen, Xueyang Yu, Yuncong Yang, Zeyuan Yang, Ziqiao Ma.

**Figure 2.** Figure 2: Overall Architecture of UniCanvas. Our model encodes multimodal inputs and input noise through a frozen tokenizer and VAE encoder, then processes them with stacked DiT blocks. A frozen VAE decoder reconstructs the predicted image, and CLIPbased losses enforce alignment with the ground-truth textual condition. pathways. Whether these two capabilities can truly co-exist and mutually enhance each other rema… view at source ↗

**Figure 3.** Figure 3: Two-stage Canvas Update in UniCanvas. In the first stage, the model “writes” the semantic texts on the canvas, while in the second stage, the modal updates the scene image conditioned on the enriched canvas. Iterative interleave generation alternates between these two stages to produce coherent step-wise multimodal sequences. 3.1 Text-in-Image Generation In text-in-image generation, UniCanvas learns to “w… view at source ↗

**Figure 4.** Figure 4: Comparison with Nano-Banana on VSP. Qualitative comparison of action-sequence generation. Our two-stage approach produces more consistent navigation trajectories, while Nano-Banana shows unstable plans and generation in both its one-stage and two-stage variants. 4.2 Experimental Results Long-Horizon cross-Modal planning. The VSP planning task serves as a challenging cross-modal reasoning benchmark that re… view at source ↗

**Figure 5.** Figure 5: Comparison with Nano-Banana on RLBench. Qualitative comparison of actionsequence generation. Our model yields clearer action intent and more accurate scene transitions, while both Nano-Banana variants often produce ambiguous or inconsistent results. Qualitative comparisons on visual quality. To further assess the qualitative behavior of our model, we compare UniCanvas with NanoBanana. In [PITH_FULL_I… view at source ↗

**Figure 6.** Figure 6: Failure case of long-horizon rollouts. UniCanvas becomes unreliable at predicting the correct next-state image when evaluated in the long sequence scenarios. Once the predicted state deviates, subsequent rollouts follow an incorrect trajectory. conditioning for both the reasoning prompt (creason), and the execution prompt (cexec) is handled through the tokenizer used in Qwen2.5-VL, which ensures robust al… view at source ↗

**Figure 7.** Figure 7: Failure case of text-in-image generation. [PITH_FULL_IMAGE:figures/full_fig_p020_7.png] view at source ↗

**Figure 8.** Figure 8: Self-correction during long-horizon rollouts. [PITH_FULL_IMAGE:figures/full_fig_p021_8.png] view at source ↗

**Figure 9.** Figure 9: Quantitative results on general visual reasoning. [PITH_FULL_IMAGE:figures/full_fig_p022_9.png] view at source ↗

**Figure 10.** Figure 10: Representative failure patterns of text generation. [PITH_FULL_IMAGE:figures/full_fig_p022_10.png] view at source ↗

read the original abstract

Recent years have seen remarkable progress in unified vision-language models handling both multimodal understanding and generation within a single architecture. While autoregressive VLMs can reason across modalities, they fail to generate high-quality images. In contrast, diffusion models produce photorealistic visuals yet struggle to generate coherent text, making it challenging to develop a single unified model that can seamlessly handle both visual and text generation. Recent advances suggest that language can be effectively embedded within visual representations, allowing models to reason about textual semantics directly from images. To this end, we propose UniCanvas, a first attempt that unifies diffusion models to generate interleaved multimodal contents through text-in-image generation. Diffusion models naturally capture transformations on a shared pixel canvas, which can be viewed as world models of visual change. Instead of producing discrete text tokens, the model learns to represent language as visual patterns inside images, leveraging its inherent multimodal embedding space. This design allows the model to "draw" text naturally within a single pixel canvas during image synthesis, achieving seamless multimodal generation. Experiments demonstrate that UniCanvas improves performance over previous unified models, positioning text-in-image generation with diffusion models as a promising unified multimodal generation paradigm.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

UniCanvas reframes unified multimodal generation by embedding text as visual patterns on the diffusion canvas, but the performance claims rest on details not visible in the abstract.

read the letter

UniCanvas's core idea is to unify text and image generation in a diffusion model by representing language as visual patterns on the image canvas instead of using separate text tokens. This lets the model 'draw' text during synthesis on the same pixel space.

The paper does well in explaining the motivation. Diffusion models already handle visual changes on a canvas, so extending that to text makes sense for avoiding the weaknesses of autoregressive models on images. The approach avoids adding discrete mechanisms and keeps everything in the multimodal embedding space.

What stands out as new is the framing of text-in-image generation as the way to achieve unified multimodal output with diffusion. It positions this as a promising paradigm.

The soft spots are around the evidence. The abstract claims better performance over previous unified models, but without metrics, baselines, or details on how text coherence was measured, it's hard to judge if the method actually delivers. The full paper may have tables and comparisons, but if they are not strong, the central claim weakens.

This paper is for people working on diffusion-based generation and multimodal unification. A reader looking for new ways to handle text in visual models could get value from the concept, though they'd need to check the experiments closely.

I think it deserves a serious referee because the idea is worth testing in the community, even if revisions are needed for the evaluation section.

Referee Report

1 major / 1 minor

Summary. The manuscript proposes UniCanvas, a diffusion-based model for unified multimodal generation that embeds language as visual patterns on a shared pixel canvas instead of using discrete text tokens. This design is intended to allow a single diffusion model to generate interleaved text and images by treating text generation as a visual synthesis task. The abstract asserts that experiments show performance improvements over prior unified models and positions the approach as a promising paradigm.

Significance. If the empirical claims are substantiated, the work could meaningfully advance unified vision-language generation by removing the need for separate autoregressive text mechanisms and leveraging diffusion models' existing pixel-canvas operations, potentially simplifying architectures for joint multimodal output.

major comments (1)

[Abstract] Abstract: the central claim that 'Experiments demonstrate that UniCanvas improves performance over previous unified models' is unsupported by any metrics, baselines, tables, figures, or experimental details in the manuscript, making the primary empirical assertion impossible to evaluate.

minor comments (1)

[Title] Title contains a clear typo: 'Diffusion-base' should read 'Diffusion-based'.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their careful reading and for highlighting this issue with the abstract. We address the comment point-by-point below.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that 'Experiments demonstrate that UniCanvas improves performance over previous unified models' is unsupported by any metrics, baselines, tables, figures, or experimental details in the manuscript, making the primary empirical assertion impossible to evaluate.

Authors: We agree that the current version of the manuscript does not contain the metrics, baselines, tables, or figures needed to substantiate the performance claim made in the abstract. In the revised manuscript we will remove the unsupported empirical assertion from the abstract (or qualify it as a direction for future work) until the corresponding experimental results can be included. revision: yes

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper introduces UniCanvas as a conceptual architecture for unified multimodal generation by embedding text as visual patterns on a shared pixel canvas within a diffusion model. No equations, parameter fittings, predictions derived from inputs, or self-citation chains appear in the provided abstract or described structure. The central claim rests on the design choice and subsequent empirical experiments, which are presented as falsifiable performance improvements rather than any self-referential derivation or renaming of known results. The derivation chain is self-contained at the level of a high-level proposal without reductions to fitted inputs or prior author work by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities; the proposal relies on high-level conceptual assumptions about visual embedding of language.

pith-pipeline@v0.9.1-grok · 5758 in / 975 out tokens · 22002 ms · 2026-06-28T10:21:48.519086+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

69 extracted references · 52 canonical work pages · 24 internal anchors

[2]

Qwen2.5-VL Technical Report

Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., et al.: Qwen2.5-vl technical report. arXiv preprint arXiv:2502.13923 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

In: Proceedings of the Computer Vision and Pattern Recognition Con- ference

Bar, A., Zhou, G., Tran, D., Darrell, T., LeCun, Y.: Navigation world mod- els. In: Proceedings of the Computer Vision and Pattern Recognition Con- ference. pp. 15791–15801 (2025)

2025
[4]

arXiv preprint arXiv:2602.02227 (2026)

Chen, H.H., Yin, X., Shu, W.J., Zhang, H., Zhang, Z., Liao, C., Guo, L., Chen, Q., Chen, Y.C.: Show, don’t tell: Morphing latent reasoning into image generation. arXiv preprint arXiv:2602.02227 (2026)

work page arXiv 2026
[5]

BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset

Chen, J., Xu, Z., Pan, X., Hu, Y., Qin, C., Goldstein, T., Huang, L., Zhou, T., Xie, S., Savarese, S., et al.: Blip3-o: A family of fully open uni- fied multimodal models-architecture, training and dataset. arXiv preprint arXiv:2505.09568 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

PixArt-$\alpha$: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis

Chen, J., Yu, J., Ge, C., Yao, L., Xie, E., Wu, Y., Wang, Z., Kwok, J., Luo, P., Lu, H., et al.: Pixart: Fast training of diffusion transformer for photore- alistic text-to-image synthesis. arXiv preprint arXiv:2310.00426 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[7]

arXiv preprint arXiv:2407.06135 (2024)

Chern, E., Su, J., Ma, Y., Liu, P.: Anole: An open, autoregressive, na- tive large multimodal models for interleaved image-text generation. arXiv preprint arXiv:2407.06135 (2024)

work page arXiv 2024
[8]

Emerging Properties in Unified Multimodal Pretraining

Deng, C., Zhu, D., Li, K., Gou, C., Li, F., Wang, Z., Zhong, S., Yu, W., Nie, X., Song, Z., et al.: Emerging properties in unified multimodal pretraining. arXiv preprint arXiv:2505.14683 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[9]

Advances in neural information processing systems34, 8780–8794 (2021)

Dhariwal, P., Nichol, A.: Diffusion models beat gans on image synthesis. Advances in neural information processing systems34, 8780–8794 (2021)

2021
[10]

arXiv preprint arXiv:2309.11499 (2023)

Dong, R., Han, C., Peng, Y., Qi, Z., Ge, Z., Yang, J., Zhao, L., Sun, J., Zhou, H., Wei, H., et al.: Dreamllm: Synergistic multimodal comprehension and creation. arXiv preprint arXiv:2309.11499 (2023)

work page arXiv 2023
[11]

GoT-R1: Unleashing Reasoning Capability of MLLM for Visual Generation with Reinforcement Learning

Duan, C., Fang, R., Wang, Y., Wang, K., Huang, L., Zeng, X., Li, H., Liu, X.: Got-r1: Unleashing reasoning capability of mllm for visual generation with reinforcement learning. arXiv preprint arXiv:2505.17022 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[12]

In: Forty-first international confer- ence on machine learning (2024)

Esser, P., Kulal, S., Blattmann, A., Entezari, R., Müller, J., Saini, H., Levi, Y., Lorenz, D., Sauer, A., Boesel, F., et al.: Scaling rectified flow transform- ers for high-resolution image synthesis. In: Forty-first international confer- ence on machine learning (2024)

2024
[13]

arXiv preprint arXiv:2405.05945 (2024) UniCanvas: Diffusion-base Unified Model for Text-in-Image Joint Generation 25

Gao, P., Zhuo, L., Liu, D., Du, R., Luo, X., Qiu, L., Zhang, Y., Lin, C., Huang, R., Geng, S., et al.: Lumina-t2x: Transforming text into any modal- ity, resolution, and duration via flow-based large diffusion transformers. arXiv preprint arXiv:2405.05945 (2024) UniCanvas: Diffusion-base Unified Model for Text-in-Image Joint Generation 25

work page arXiv 2024
[14]

arXiv preprint arXiv:2503.18938 (2025)

Gao, S., Zhou, S., Du, Y., Zhang, J., Gan, C.: Adaworld: Learning adaptable world models with latent actions. arXiv preprint arXiv:2503.18938 (2025)

work page arXiv 2025
[15]

ThinkMorph: Emergent properties in multimodal interleaved chain-of-thought reasoning.arXiv preprint arXiv:2510.27492, 2025

Gu, J., Hao, Y., Wang, H.W., Li, L., Shieh, M.Q., Choi, Y., Krishna, R., Cheng, Y.: Thinkmorph: Emergent properties in multimodal interleaved chain-of-thought reasoning. arXiv preprint arXiv:2510.27492 (2025)

work page arXiv 2025
[16]

arXiv preprint arXiv:2511.16671 (2025)

Guo, Z., Zhang, R., Li, H., Zhang, M., Chen, X., Wang, S., Feng, Y., Pei, P., Heng, P.A.: Thinking-while-generating: Interleaving textual reasoning throughout visual generation. arXiv preprint arXiv:2511.16671 (2025)

work page arXiv 2025
[17]

arXiv preprint arXiv:2501.13926 (2025)

Guo, Z., Zhang, R., Tong, C., Zhao, Z., Huang, R., Zhang, H., Zhang, M., Liu, J., Zhang, S., Gao, P., et al.: Can we generate images with cot? let’s verify and reinforce image generation step by step. arXiv preprint arXiv:2501.13926 (2025)

work page arXiv 2025
[18]

World Models

Ha, D., Schmidhuber, J.: World models. arXiv preprint arXiv:1803.10122 2(3), 440 (2018)

work page internal anchor Pith review Pith/arXiv arXiv 2018
[19]

Dream to Control: Learning Behaviors by Latent Imagination

Hafner, D., Lillicrap, T., Ba, J., Norouzi, M.: Dream to control: Learning behaviors by latent imagination. arXiv preprint arXiv:1912.01603 (2019)

work page internal anchor Pith review Pith/arXiv arXiv 1912
[20]

arXiv preprint arXiv:2601.03193 (2026)

Han, R., Fang, Z., Sun, X., Ma, Y., Wang, Z., Zeng, Y., Chen, Z., Chen, L., Huang, W., Xu, W.J., et al.: Unicorn: Towards self-improving uni- fied multimodal models through self-generated supervision. arXiv preprint arXiv:2601.03193 (2026)

work page arXiv 2026
[21]

In: Advances in Neural Information Processing Systems (2020)

Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. In: Advances in Neural Information Processing Systems (2020)

2020
[22]

Classifier-Free Diffusion Guidance

Ho, J., Salimans, T.: Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022
[23]

IEEE Robotics and Automation Letters5(2), 3019–3026 (2020)

James, S., Ma, Z., Arrojo, D.R., Davison, A.J.: Rlbench: The robot learn- ing benchmark & learning environment. IEEE Robotics and Automation Letters5(2), 3019–3026 (2020)

2020
[24]

arXiv preprint arXiv:2505.00703 (2025)

Jiang, D., Guo, Z., Zhang, R., Zong, Z., Li, H., Zhuo, L., Yan, S., Heng, P.A., Li, H.: T2i-r1: Reinforcing image generation with collaborative semantic- level and token-level cot. arXiv preprint arXiv:2505.00703 (2025)

work page arXiv 2025
[25]

arXiv preprint arXiv:2512.05112 (2025)

Jiang, D., Zhang, R., Li, H., Zong, Z., Guo, Z., He, J., Guo, C., Ye, J., Fang, R., Li, W., et al.: Draco: Draft as cot for text-to-image preview and rare concept generation. arXiv preprint arXiv:2512.05112 (2025)

work page arXiv 2025
[26]

arXiv preprint arXiv:2505.17534 (2025)

Jiang, J., Si, C., Luo, J., Zhang, H., Ma, C.: Co-reinforcement learn- ing for unified multimodal understanding and generation. arXiv preprint arXiv:2505.17534 (2025)

work page arXiv 2025
[27]

Advances in neural information process- ing systems35, 26565–26577 (2022)

Karras, T., Aittala, M., Aila, T., Laine, S.: Elucidating the design space of diffusion-based generative models. Advances in neural information process- ing systems35, 26565–26577 (2022)

2022
[28]

Advances in neural information processing systems34, 21696–21707 (2021)

Kingma, D., Salimans, T., Poole, B., Ho, J.: Variational diffusion models. Advances in neural information processing systems34, 21696–21707 (2021)

2021
[29]

Imagine while Reasoning in Space: Multimodal Visualization-of-Thought

Li, C., Wu, W., Zhang, H., Xia, Y., Mao, S., Dong, L., Vulić, I., Wei, F.: Imagine while reasoning in space: Multimodal visualization-of-thought. arXiv preprint arXiv:2501.07542 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[30]

Causal World Modeling for Robot Control

Li, L., Zhang, Q., Luo, Y., Yang, S., Wang, R., Han, F., Yu, M., Gao, Z., Xue, N., Zhu, X., Shen, Y., Xu, Y.: Causal world modeling for robot control. arXiv preprint arXiv:2601.21998 (2026) 26 Z. Yang et al

work page internal anchor Pith review Pith/arXiv arXiv 2026
[31]

Genie Envisioner: A Unified World Foundation Platform for Robotic Manipulation

Liao, Y., Zhou, P., Huang, S., Yang, D., Chen, S., Jiang, Y., Hu, Y., Cai, J., Liu, S., Luo, J., et al.: Genie envisioner: A unified world foundation platform for robotic manipulation. arXiv preprint arXiv:2508.05635 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[32]

In: NeurIPS (2023)

Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. In: NeurIPS (2023)

2023
[33]

arXiv preprint arXiv:2502.20321 (2025)

Ma, C., Jiang, Y., Wu, J., Yang, J., Yu, X., Yuan, Z., Peng, B., Qi, X.: Unitok: A unified tokenizer for visual generation and understanding. arXiv preprint arXiv:2502.20321 (2025)

work page arXiv 2025
[34]

arXiv preprint arXiv:2411.07975 (2024)

Ma, Y., Liu, X., Chen, X., Liu, W., Wu, C., Wu, Z., Pan, Z., Xie, Z., Zhang, H., Zhao, L., et al.: Janusflow: Harmonizing autoregression and rectified flow for unified multimodal understanding and generation. arXiv preprint arXiv:2411.07975 (2024)

work page arXiv 2024
[35]

In: International conference on machine learning

Nichol, A.Q., Dhariwal, P.: Improved denoising diffusion probabilistic mod- els. In: International conference on machine learning. pp. 8162–8171. PMLR (2021)

2021
[36]

SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

Podell, D., English, Z., Lacey, K., Blattmann, A., Dockhorn, T., Müller, J., Penna, J., Rombach, R.: Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[37]

arXiv preprint arXiv:2510.07313 (2025)

Qian, Z., Chi, X., Li, Y., Wang, S., Qin, Z., Ju, X., Han, S., Zhang, S.: Wristworld: Generating wrist-views via 4d world models for robotic manip- ulation. arXiv preprint arXiv:2510.07313 (2025)

work page arXiv 2025
[38]

arXiv preprint arXiv:2508.05606 (2025)

Qin, L., Gong, J., Sun, Y., Li, T., Yang, M., Yang, X., Qu, C., Tan, Z., Li, H.: Uni-cot: Towards unified chain-of-thought reasoning across text and vision. arXiv preprint arXiv:2508.05606 (2025)

work page arXiv 2025
[39]

Hierarchical Text-Conditional Image Generation with CLIP Latents

Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., Chen, M.: Hierarchi- cal text-conditional image generation with clip latents. arXiv preprint arXiv:2204.061251(2), 3 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022
[40]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High- resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10684–10695 (2022)

2022
[41]

Advances in neural information processing systems35, 36479– 36494 (2022)

Saharia, C., Chan, W., Saxena, S., Li, L., Whang, J., Denton, E.L., Ghasemipour, K., Gontijo Lopes, R., Karagol Ayan, B., Salimans, T., et al.: Photorealistic text-to-image diffusion models with deep language under- standing. Advances in neural information processing systems35, 36479– 36494 (2022)

2022
[42]

Muddit: Liberating Generation Beyond Text-to-Image with a Unified Discrete Diffusion Model

Shi, Q., Bai, J., Zhao, Z., Chai, W., Yu, K., Wu, J., Song, S., Tong, Y., Li, X., Li, X., et al.: Muddit: Liberating generation beyond text-to-image with a unified discrete diffusion model. arXiv:2505.23606 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[43]

In: Proceedings of the 32nd International Conference on Machine Learning

Sohl-Dickstein,J.,Weiss,E.A.,Maheswaranathan,N.,Ganguli,S.:Deepun- supervised learning using nonequilibrium thermodynamics. In: Proceedings of the 32nd International Conference on Machine Learning. pp. 2256–2265 (2015)

2015
[44]

Advances in neural information processing systems34, 1415–1428 (2021) UniCanvas: Diffusion-base Unified Model for Text-in-Image Joint Generation 27

Song, Y., Durkan, C., Murray, I., Ermon, S.: Maximum likelihood training of score-based diffusion models. Advances in neural information processing systems34, 1415–1428 (2021) UniCanvas: Diffusion-base Unified Model for Text-in-Image Joint Generation 27

2021
[45]

Score-Based Generative Modeling through Stochastic Differential Equations

Song,Y.,Sohl-Dickstein,J.,Kingma,D.P.,Kumar,A.,Ermon,S.,Poole,B.: Score-based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456 (2020)

work page internal anchor Pith review Pith/arXiv arXiv 2011
[46]

Generation Enhances Understanding in Unified Multimodal Models via Multi-Representation Generation

Su, Z., Wei, H., Cen, K., Wang, Y., Chen, G., Yuan, C., Chu, X.: Gen- eration enhances understanding in unified multimodal models via multi- representation generation. arXiv preprint arXiv:2601.21406 (2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026
[47]

Chameleon: Mixed-Modal Early-Fusion Foundation Models

Team, C.: Chameleon: Mixed-modal early-fusion foundation models, 2024. URL https://arxiv. org/abs/2405.098189(8) (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[48]

Next-Latent Prediction Transformers Learn Compact World Models

Teoh, J., Tomar, M., Ahn, K., Hu, E.S., Sharma, P., Islam, R., Lamb, A., Langford, J.: Next-latent prediction transformers learn compact world models. arXiv preprint arXiv:2511.05963 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[49]

arXiv preprint arXiv:2401.10208 (2024)

Tian, C., Zhu, X., Xiong, Y., Wang, W., Chen, Z., Wang, W., Chen, Y., Lu, L., Lu, T., Zhou, J., et al.: Mm-interleaved: Interleaved image-text generative modeling via multi-modal feature synchronizer. arXiv preprint arXiv:2401.10208 (2024)

work page arXiv 2024
[50]

MetaMorph: Multimodal Understanding and Generation via Instruction Tuning

Tong, S., Fan, D., Zhu, J., Xiong, Y., Chen, X., Sinha, K., Rabbat, M., LeCun, Y., Xie, S., Liu, Z.: Metamorph: Multimodal understanding and generation via instruction tuning. arXiv preprint arXiv:2412.14164 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[51]

arXiv preprint arXiv:2602.01382 (2026)

Wang,F.Y.,Zhang,H.,Gharbi,M.,Li,H.,Park,T.:Promptrl:Promptmat- ters in rl for flow-based image generation. arXiv preprint arXiv:2602.01382 (2026)

work page arXiv 2026
[52]

arXiv:2505.20147 (2025)

Wang, J., Lai, Y., Li, A., Zhang, S., Sun, J., Kang, N., Wu, C., Li, Z., Luo, P.: Fudoki: Discrete flow-based unified understanding and generation via kinetic-optimal velocities. arXiv:2505.20147 (2025)

work page arXiv 2025
[53]

arXiv preprint arXiv:2411.07199 (2024)

Wei, C., Xiong, Z., Ren, W., Du, X., Zhang, G., Chen, W.: Omniedit: Build- ing image editing generalist models through specialist supervision. arXiv preprint arXiv:2411.07199 (2024)

work page arXiv 2024
[54]

arXiv preprint arXiv:2601.19834 (2026)

Wu, J., Zhang, X., Yuan, H., Zhang, X., Huang, T., He, C., Deng, C., Zhang, R., Wu, Y., Long, M.: Visual generation unlocks human-like rea- soning through multimodal world models. arXiv preprint arXiv:2601.19834 (2026)

work page arXiv 2026
[55]

arXiv preprint arXiv:2407.01863 (2024)

Wu, Q., Zhao, H., Saxon, M., Bui, T., Wang, W.Y., Zhang, Y., Chang, S.: Vsp: Assessing the dual challenges of perception and reasoning in spatial planning tasks for vlms. arXiv preprint arXiv:2407.01863 (2024)

work page arXiv 2024
[56]

VILA-U: a Unified Foundation Model Integrating Visual Understanding and Generation

Wu, Y., Zhang, Z., Chen, J., Tang, H., Li, D., Fang, Y., Zhu, L., Xie, E., Yin, H., Yi, L., et al.: Vila-u: a unified foundation model integrating visual understanding and generation. arXiv preprint arXiv:2409.04429 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[57]

arXiv preprint arXiv:2505.13031 (2025)

Xiao, Y., Song, L., Chen, Y., Luo, Y., Chen, Y., Gan, Y., Huang, W., Li, X., Qi, X., Shan, Y.: Mindomni: Unleashing reasoning generation in vision language models with rgpo. arXiv preprint arXiv:2505.13031 (2025)

work page arXiv 2025
[58]

In: The Thirteenth International Conference on Learning Representations (2025)

Xie, J., Mao, W., Bai, Z., Zhang, D.J., Wang, W., Lin, K.Q., Gu, Y., Chen, Z., Yang, Z., Shou, M.Z.: Show-o: One single transformer to unify mul- timodal understanding and generation. In: The Thirteenth International Conference on Learning Representations (2025)

2025
[59]

arXiv preprint arXiv:2505.11409 (2025) 28 Z

Xu, Y., Li, C., Zhou, H., Wan, X., Zhang, C., Korhonen, A., Vulić, I.: Visual planning: Let’s think only with images. arXiv preprint arXiv:2505.11409 (2025) 28 Z. Yang et al

work page arXiv 2025
[60]

MMaDA: Multimodal Large Diffusion Language Models

Yang, L., Tian, Y., Li, B., Zhang, X., Shen, K., Tong, Y., Wang, M.: Mmada: Multimodal large diffusion language models. arXiv preprint arXiv:2505.15809 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[61]

arXiv preprint arXiv:2507.12508 (2025)

Yang, Y., Liu, J., Zhang, Z., Zhou, S., Tan, R., Yang, J., Du, Y., Gan, C.: Mindjourney: Test-time scaling with world models for spatial reasoning. arXiv preprint arXiv:2507.12508 (2025)

work page arXiv 2025
[62]

Machine Mental Imagery: Empower Multimodal Reasoning with Latent Visual Tokens

Yang, Z., Yu, X., Chen, D., Shen, M., Gan, C.: Machine mental imagery: Empower multimodal reasoning with latent visual tokens. arXiv preprint arXiv:2506.17218 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[63]

Ye, S., Ge, Y., Zheng, K., Gao, S., Yu, S., Kurian, G., Indupuru, S., Tan, Y.L., Zhu, C., Xiang, J., Malik, A., Lee, K., Liang, W., Ranawaka, N., Gu, J., Xu, Y., Wang, G., Hu, F., Narayan, A., Bjorck, J., Wang, J., Kim, G., Niu, D., Zheng, R., Xie, Y., Wu, J., Wang, Q., Julian, R., Xu, D., Du, Y., Chebotar, Y., Reed, S., Kautz, J., Zhu, Y., Fan, L.J., Jan...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[64]

arXiv preprint arXiv:2511.22625 (2025)

Yin, F., Liu, S., Han, Y., Wang, Z., Xing, P., Wang, R., Cheng, W., Wang, Y., Li, A., Yin, Z., et al.: Reasonedit: Towards reasoning-enhanced image editing models. arXiv preprint arXiv:2511.22625 (2025)

work page arXiv 2025
[65]

In: Proceedings of the 33rd ACM International Conference on Mul- timedia

Zeng, B., Yang, L., Liu, J., Xu, M., Zhang, Y., Wan, P., Zhang, W., Yan, S.: Editworld: Simulating world dynamics for instruction-following image editing. In: Proceedings of the 33rd ACM International Conference on Mul- timedia. pp. 12674–12681 (2025)

2025
[66]

In: CVPR (2018)

Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreason- able effectiveness of deep features as a perceptual metric. In: CVPR (2018)

2018
[67]

arXiv preprint arXiv:2602.12322 (2026)

Zhang, Z., Yang, S., Hu, Q., Huang, L.J., Hou, J., Sun, Y., Lu, Y., Han, S.: Foreact: Steering your vla with efficient visual foresight planning. arXiv preprint arXiv:2602.12322 (2026)

work page arXiv 2026
[68]

3D-VLA: A 3D Vision-Language-Action Generative World Model

Zhen, H., Qiu, X., Chen, P., Yang, J., Yan, X., Du, Y., Hong, Y., Gan, C.: 3d-vla: A 3d vision-language-action generative world model. arXiv preprint arXiv:2403.09631 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[69]

arXiv preprint arXiv:2504.20995 (2025)

Zhen, H., Sun, Q., Zhang, H., Li, J., Zhou, S., Du, Y., Gan, C.: Tesser- act: Learning 4d embodied world models. arXiv preprint arXiv:2504.20995 (2025)

work page arXiv 2025
[70]

In: The Thirteenth Inter- national Conference on Learning Representations (2025)

Zhou, C., Yu, L., Babu, A., Tirumala, K., Yasunaga, M., Shamis, L., Kahn, J., Ma, X., Zettlemoyer, L., Levy, O.: Transfusion: Predict the next token and diffuse images with one multi-modal model. In: The Thirteenth Inter- national Conference on Learning Representations (2025)

2025

[1] [2]

Qwen2.5-VL Technical Report

Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., et al.: Qwen2.5-vl technical report. arXiv preprint arXiv:2502.13923 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[2] [3]

In: Proceedings of the Computer Vision and Pattern Recognition Con- ference

Bar, A., Zhou, G., Tran, D., Darrell, T., LeCun, Y.: Navigation world mod- els. In: Proceedings of the Computer Vision and Pattern Recognition Con- ference. pp. 15791–15801 (2025)

2025

[3] [4]

arXiv preprint arXiv:2602.02227 (2026)

Chen, H.H., Yin, X., Shu, W.J., Zhang, H., Zhang, Z., Liao, C., Guo, L., Chen, Q., Chen, Y.C.: Show, don’t tell: Morphing latent reasoning into image generation. arXiv preprint arXiv:2602.02227 (2026)

work page arXiv 2026

[4] [5]

BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset

Chen, J., Xu, Z., Pan, X., Hu, Y., Qin, C., Goldstein, T., Huang, L., Zhou, T., Xie, S., Savarese, S., et al.: Blip3-o: A family of fully open uni- fied multimodal models-architecture, training and dataset. arXiv preprint arXiv:2505.09568 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[5] [6]

PixArt-$\alpha$: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis

Chen, J., Yu, J., Ge, C., Yao, L., Xie, E., Wu, Y., Wang, Z., Kwok, J., Luo, P., Lu, H., et al.: Pixart: Fast training of diffusion transformer for photore- alistic text-to-image synthesis. arXiv preprint arXiv:2310.00426 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[6] [7]

arXiv preprint arXiv:2407.06135 (2024)

Chern, E., Su, J., Ma, Y., Liu, P.: Anole: An open, autoregressive, na- tive large multimodal models for interleaved image-text generation. arXiv preprint arXiv:2407.06135 (2024)

work page arXiv 2024

[7] [8]

Emerging Properties in Unified Multimodal Pretraining

Deng, C., Zhu, D., Li, K., Gou, C., Li, F., Wang, Z., Zhong, S., Yu, W., Nie, X., Song, Z., et al.: Emerging properties in unified multimodal pretraining. arXiv preprint arXiv:2505.14683 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[8] [9]

Advances in neural information processing systems34, 8780–8794 (2021)

Dhariwal, P., Nichol, A.: Diffusion models beat gans on image synthesis. Advances in neural information processing systems34, 8780–8794 (2021)

2021

[9] [10]

arXiv preprint arXiv:2309.11499 (2023)

Dong, R., Han, C., Peng, Y., Qi, Z., Ge, Z., Yang, J., Zhao, L., Sun, J., Zhou, H., Wei, H., et al.: Dreamllm: Synergistic multimodal comprehension and creation. arXiv preprint arXiv:2309.11499 (2023)

work page arXiv 2023

[10] [11]

GoT-R1: Unleashing Reasoning Capability of MLLM for Visual Generation with Reinforcement Learning

Duan, C., Fang, R., Wang, Y., Wang, K., Huang, L., Zeng, X., Li, H., Liu, X.: Got-r1: Unleashing reasoning capability of mllm for visual generation with reinforcement learning. arXiv preprint arXiv:2505.17022 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[11] [12]

In: Forty-first international confer- ence on machine learning (2024)

Esser, P., Kulal, S., Blattmann, A., Entezari, R., Müller, J., Saini, H., Levi, Y., Lorenz, D., Sauer, A., Boesel, F., et al.: Scaling rectified flow transform- ers for high-resolution image synthesis. In: Forty-first international confer- ence on machine learning (2024)

2024

[12] [13]

arXiv preprint arXiv:2405.05945 (2024) UniCanvas: Diffusion-base Unified Model for Text-in-Image Joint Generation 25

Gao, P., Zhuo, L., Liu, D., Du, R., Luo, X., Qiu, L., Zhang, Y., Lin, C., Huang, R., Geng, S., et al.: Lumina-t2x: Transforming text into any modal- ity, resolution, and duration via flow-based large diffusion transformers. arXiv preprint arXiv:2405.05945 (2024) UniCanvas: Diffusion-base Unified Model for Text-in-Image Joint Generation 25

work page arXiv 2024

[13] [14]

arXiv preprint arXiv:2503.18938 (2025)

Gao, S., Zhou, S., Du, Y., Zhang, J., Gan, C.: Adaworld: Learning adaptable world models with latent actions. arXiv preprint arXiv:2503.18938 (2025)

work page arXiv 2025

[14] [15]

ThinkMorph: Emergent properties in multimodal interleaved chain-of-thought reasoning.arXiv preprint arXiv:2510.27492, 2025

Gu, J., Hao, Y., Wang, H.W., Li, L., Shieh, M.Q., Choi, Y., Krishna, R., Cheng, Y.: Thinkmorph: Emergent properties in multimodal interleaved chain-of-thought reasoning. arXiv preprint arXiv:2510.27492 (2025)

work page arXiv 2025

[15] [16]

arXiv preprint arXiv:2511.16671 (2025)

Guo, Z., Zhang, R., Li, H., Zhang, M., Chen, X., Wang, S., Feng, Y., Pei, P., Heng, P.A.: Thinking-while-generating: Interleaving textual reasoning throughout visual generation. arXiv preprint arXiv:2511.16671 (2025)

work page arXiv 2025

[16] [17]

arXiv preprint arXiv:2501.13926 (2025)

Guo, Z., Zhang, R., Tong, C., Zhao, Z., Huang, R., Zhang, H., Zhang, M., Liu, J., Zhang, S., Gao, P., et al.: Can we generate images with cot? let’s verify and reinforce image generation step by step. arXiv preprint arXiv:2501.13926 (2025)

work page arXiv 2025

[17] [18]

World Models

Ha, D., Schmidhuber, J.: World models. arXiv preprint arXiv:1803.10122 2(3), 440 (2018)

work page internal anchor Pith review Pith/arXiv arXiv 2018

[18] [19]

Dream to Control: Learning Behaviors by Latent Imagination

Hafner, D., Lillicrap, T., Ba, J., Norouzi, M.: Dream to control: Learning behaviors by latent imagination. arXiv preprint arXiv:1912.01603 (2019)

work page internal anchor Pith review Pith/arXiv arXiv 1912

[19] [20]

arXiv preprint arXiv:2601.03193 (2026)

Han, R., Fang, Z., Sun, X., Ma, Y., Wang, Z., Zeng, Y., Chen, Z., Chen, L., Huang, W., Xu, W.J., et al.: Unicorn: Towards self-improving uni- fied multimodal models through self-generated supervision. arXiv preprint arXiv:2601.03193 (2026)

work page arXiv 2026

[20] [21]

In: Advances in Neural Information Processing Systems (2020)

Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. In: Advances in Neural Information Processing Systems (2020)

2020

[21] [22]

Classifier-Free Diffusion Guidance

Ho, J., Salimans, T.: Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022

[22] [23]

IEEE Robotics and Automation Letters5(2), 3019–3026 (2020)

James, S., Ma, Z., Arrojo, D.R., Davison, A.J.: Rlbench: The robot learn- ing benchmark & learning environment. IEEE Robotics and Automation Letters5(2), 3019–3026 (2020)

2020

[23] [24]

arXiv preprint arXiv:2505.00703 (2025)

Jiang, D., Guo, Z., Zhang, R., Zong, Z., Li, H., Zhuo, L., Yan, S., Heng, P.A., Li, H.: T2i-r1: Reinforcing image generation with collaborative semantic- level and token-level cot. arXiv preprint arXiv:2505.00703 (2025)

work page arXiv 2025

[24] [25]

arXiv preprint arXiv:2512.05112 (2025)

Jiang, D., Zhang, R., Li, H., Zong, Z., Guo, Z., He, J., Guo, C., Ye, J., Fang, R., Li, W., et al.: Draco: Draft as cot for text-to-image preview and rare concept generation. arXiv preprint arXiv:2512.05112 (2025)

work page arXiv 2025

[25] [26]

arXiv preprint arXiv:2505.17534 (2025)

Jiang, J., Si, C., Luo, J., Zhang, H., Ma, C.: Co-reinforcement learn- ing for unified multimodal understanding and generation. arXiv preprint arXiv:2505.17534 (2025)

work page arXiv 2025

[26] [27]

Advances in neural information process- ing systems35, 26565–26577 (2022)

Karras, T., Aittala, M., Aila, T., Laine, S.: Elucidating the design space of diffusion-based generative models. Advances in neural information process- ing systems35, 26565–26577 (2022)

2022

[27] [28]

Advances in neural information processing systems34, 21696–21707 (2021)

Kingma, D., Salimans, T., Poole, B., Ho, J.: Variational diffusion models. Advances in neural information processing systems34, 21696–21707 (2021)

2021

[28] [29]

Imagine while Reasoning in Space: Multimodal Visualization-of-Thought

Li, C., Wu, W., Zhang, H., Xia, Y., Mao, S., Dong, L., Vulić, I., Wei, F.: Imagine while reasoning in space: Multimodal visualization-of-thought. arXiv preprint arXiv:2501.07542 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[29] [30]

Causal World Modeling for Robot Control

Li, L., Zhang, Q., Luo, Y., Yang, S., Wang, R., Han, F., Yu, M., Gao, Z., Xue, N., Zhu, X., Shen, Y., Xu, Y.: Causal world modeling for robot control. arXiv preprint arXiv:2601.21998 (2026) 26 Z. Yang et al

work page internal anchor Pith review Pith/arXiv arXiv 2026

[30] [31]

Genie Envisioner: A Unified World Foundation Platform for Robotic Manipulation

Liao, Y., Zhou, P., Huang, S., Yang, D., Chen, S., Jiang, Y., Hu, Y., Cai, J., Liu, S., Luo, J., et al.: Genie envisioner: A unified world foundation platform for robotic manipulation. arXiv preprint arXiv:2508.05635 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[31] [32]

In: NeurIPS (2023)

Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. In: NeurIPS (2023)

2023

[32] [33]

arXiv preprint arXiv:2502.20321 (2025)

Ma, C., Jiang, Y., Wu, J., Yang, J., Yu, X., Yuan, Z., Peng, B., Qi, X.: Unitok: A unified tokenizer for visual generation and understanding. arXiv preprint arXiv:2502.20321 (2025)

work page arXiv 2025

[33] [34]

arXiv preprint arXiv:2411.07975 (2024)

Ma, Y., Liu, X., Chen, X., Liu, W., Wu, C., Wu, Z., Pan, Z., Xie, Z., Zhang, H., Zhao, L., et al.: Janusflow: Harmonizing autoregression and rectified flow for unified multimodal understanding and generation. arXiv preprint arXiv:2411.07975 (2024)

work page arXiv 2024

[34] [35]

In: International conference on machine learning

Nichol, A.Q., Dhariwal, P.: Improved denoising diffusion probabilistic mod- els. In: International conference on machine learning. pp. 8162–8171. PMLR (2021)

2021

[35] [36]

SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

Podell, D., English, Z., Lacey, K., Blattmann, A., Dockhorn, T., Müller, J., Penna, J., Rombach, R.: Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[36] [37]

arXiv preprint arXiv:2510.07313 (2025)

Qian, Z., Chi, X., Li, Y., Wang, S., Qin, Z., Ju, X., Han, S., Zhang, S.: Wristworld: Generating wrist-views via 4d world models for robotic manip- ulation. arXiv preprint arXiv:2510.07313 (2025)

work page arXiv 2025

[37] [38]

arXiv preprint arXiv:2508.05606 (2025)

Qin, L., Gong, J., Sun, Y., Li, T., Yang, M., Yang, X., Qu, C., Tan, Z., Li, H.: Uni-cot: Towards unified chain-of-thought reasoning across text and vision. arXiv preprint arXiv:2508.05606 (2025)

work page arXiv 2025

[38] [39]

Hierarchical Text-Conditional Image Generation with CLIP Latents

Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., Chen, M.: Hierarchi- cal text-conditional image generation with clip latents. arXiv preprint arXiv:2204.061251(2), 3 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022

[39] [40]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High- resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10684–10695 (2022)

2022

[40] [41]

Advances in neural information processing systems35, 36479– 36494 (2022)

Saharia, C., Chan, W., Saxena, S., Li, L., Whang, J., Denton, E.L., Ghasemipour, K., Gontijo Lopes, R., Karagol Ayan, B., Salimans, T., et al.: Photorealistic text-to-image diffusion models with deep language under- standing. Advances in neural information processing systems35, 36479– 36494 (2022)

2022

[41] [42]

Muddit: Liberating Generation Beyond Text-to-Image with a Unified Discrete Diffusion Model

Shi, Q., Bai, J., Zhao, Z., Chai, W., Yu, K., Wu, J., Song, S., Tong, Y., Li, X., Li, X., et al.: Muddit: Liberating generation beyond text-to-image with a unified discrete diffusion model. arXiv:2505.23606 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[42] [43]

In: Proceedings of the 32nd International Conference on Machine Learning

Sohl-Dickstein,J.,Weiss,E.A.,Maheswaranathan,N.,Ganguli,S.:Deepun- supervised learning using nonequilibrium thermodynamics. In: Proceedings of the 32nd International Conference on Machine Learning. pp. 2256–2265 (2015)

2015

[43] [44]

Advances in neural information processing systems34, 1415–1428 (2021) UniCanvas: Diffusion-base Unified Model for Text-in-Image Joint Generation 27

Song, Y., Durkan, C., Murray, I., Ermon, S.: Maximum likelihood training of score-based diffusion models. Advances in neural information processing systems34, 1415–1428 (2021) UniCanvas: Diffusion-base Unified Model for Text-in-Image Joint Generation 27

2021

[44] [45]

Score-Based Generative Modeling through Stochastic Differential Equations

Song,Y.,Sohl-Dickstein,J.,Kingma,D.P.,Kumar,A.,Ermon,S.,Poole,B.: Score-based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456 (2020)

work page internal anchor Pith review Pith/arXiv arXiv 2011

[45] [46]

Generation Enhances Understanding in Unified Multimodal Models via Multi-Representation Generation

Su, Z., Wei, H., Cen, K., Wang, Y., Chen, G., Yuan, C., Chu, X.: Gen- eration enhances understanding in unified multimodal models via multi- representation generation. arXiv preprint arXiv:2601.21406 (2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026

[46] [47]

Chameleon: Mixed-Modal Early-Fusion Foundation Models

Team, C.: Chameleon: Mixed-modal early-fusion foundation models, 2024. URL https://arxiv. org/abs/2405.098189(8) (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[47] [48]

Next-Latent Prediction Transformers Learn Compact World Models

Teoh, J., Tomar, M., Ahn, K., Hu, E.S., Sharma, P., Islam, R., Lamb, A., Langford, J.: Next-latent prediction transformers learn compact world models. arXiv preprint arXiv:2511.05963 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[48] [49]

arXiv preprint arXiv:2401.10208 (2024)

Tian, C., Zhu, X., Xiong, Y., Wang, W., Chen, Z., Wang, W., Chen, Y., Lu, L., Lu, T., Zhou, J., et al.: Mm-interleaved: Interleaved image-text generative modeling via multi-modal feature synchronizer. arXiv preprint arXiv:2401.10208 (2024)

work page arXiv 2024

[49] [50]

MetaMorph: Multimodal Understanding and Generation via Instruction Tuning

Tong, S., Fan, D., Zhu, J., Xiong, Y., Chen, X., Sinha, K., Rabbat, M., LeCun, Y., Xie, S., Liu, Z.: Metamorph: Multimodal understanding and generation via instruction tuning. arXiv preprint arXiv:2412.14164 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[50] [51]

arXiv preprint arXiv:2602.01382 (2026)

Wang,F.Y.,Zhang,H.,Gharbi,M.,Li,H.,Park,T.:Promptrl:Promptmat- ters in rl for flow-based image generation. arXiv preprint arXiv:2602.01382 (2026)

work page arXiv 2026

[51] [52]

arXiv:2505.20147 (2025)

Wang, J., Lai, Y., Li, A., Zhang, S., Sun, J., Kang, N., Wu, C., Li, Z., Luo, P.: Fudoki: Discrete flow-based unified understanding and generation via kinetic-optimal velocities. arXiv:2505.20147 (2025)

work page arXiv 2025

[52] [53]

arXiv preprint arXiv:2411.07199 (2024)

Wei, C., Xiong, Z., Ren, W., Du, X., Zhang, G., Chen, W.: Omniedit: Build- ing image editing generalist models through specialist supervision. arXiv preprint arXiv:2411.07199 (2024)

work page arXiv 2024

[53] [54]

arXiv preprint arXiv:2601.19834 (2026)

Wu, J., Zhang, X., Yuan, H., Zhang, X., Huang, T., He, C., Deng, C., Zhang, R., Wu, Y., Long, M.: Visual generation unlocks human-like rea- soning through multimodal world models. arXiv preprint arXiv:2601.19834 (2026)

work page arXiv 2026

[54] [55]

arXiv preprint arXiv:2407.01863 (2024)

Wu, Q., Zhao, H., Saxon, M., Bui, T., Wang, W.Y., Zhang, Y., Chang, S.: Vsp: Assessing the dual challenges of perception and reasoning in spatial planning tasks for vlms. arXiv preprint arXiv:2407.01863 (2024)

work page arXiv 2024

[55] [56]

VILA-U: a Unified Foundation Model Integrating Visual Understanding and Generation

Wu, Y., Zhang, Z., Chen, J., Tang, H., Li, D., Fang, Y., Zhu, L., Xie, E., Yin, H., Yi, L., et al.: Vila-u: a unified foundation model integrating visual understanding and generation. arXiv preprint arXiv:2409.04429 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[56] [57]

arXiv preprint arXiv:2505.13031 (2025)

Xiao, Y., Song, L., Chen, Y., Luo, Y., Chen, Y., Gan, Y., Huang, W., Li, X., Qi, X., Shan, Y.: Mindomni: Unleashing reasoning generation in vision language models with rgpo. arXiv preprint arXiv:2505.13031 (2025)

work page arXiv 2025

[57] [58]

In: The Thirteenth International Conference on Learning Representations (2025)

Xie, J., Mao, W., Bai, Z., Zhang, D.J., Wang, W., Lin, K.Q., Gu, Y., Chen, Z., Yang, Z., Shou, M.Z.: Show-o: One single transformer to unify mul- timodal understanding and generation. In: The Thirteenth International Conference on Learning Representations (2025)

2025

[58] [59]

arXiv preprint arXiv:2505.11409 (2025) 28 Z

Xu, Y., Li, C., Zhou, H., Wan, X., Zhang, C., Korhonen, A., Vulić, I.: Visual planning: Let’s think only with images. arXiv preprint arXiv:2505.11409 (2025) 28 Z. Yang et al

work page arXiv 2025

[59] [60]

MMaDA: Multimodal Large Diffusion Language Models

Yang, L., Tian, Y., Li, B., Zhang, X., Shen, K., Tong, Y., Wang, M.: Mmada: Multimodal large diffusion language models. arXiv preprint arXiv:2505.15809 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[60] [61]

arXiv preprint arXiv:2507.12508 (2025)

Yang, Y., Liu, J., Zhang, Z., Zhou, S., Tan, R., Yang, J., Du, Y., Gan, C.: Mindjourney: Test-time scaling with world models for spatial reasoning. arXiv preprint arXiv:2507.12508 (2025)

work page arXiv 2025

[61] [62]

Machine Mental Imagery: Empower Multimodal Reasoning with Latent Visual Tokens

Yang, Z., Yu, X., Chen, D., Shen, M., Gan, C.: Machine mental imagery: Empower multimodal reasoning with latent visual tokens. arXiv preprint arXiv:2506.17218 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[62] [63]

Ye, S., Ge, Y., Zheng, K., Gao, S., Yu, S., Kurian, G., Indupuru, S., Tan, Y.L., Zhu, C., Xiang, J., Malik, A., Lee, K., Liang, W., Ranawaka, N., Gu, J., Xu, Y., Wang, G., Hu, F., Narayan, A., Bjorck, J., Wang, J., Kim, G., Niu, D., Zheng, R., Xie, Y., Wu, J., Wang, Q., Julian, R., Xu, D., Du, Y., Chebotar, Y., Reed, S., Kautz, J., Zhu, Y., Fan, L.J., Jan...

work page internal anchor Pith review Pith/arXiv arXiv 2026

[63] [64]

arXiv preprint arXiv:2511.22625 (2025)

Yin, F., Liu, S., Han, Y., Wang, Z., Xing, P., Wang, R., Cheng, W., Wang, Y., Li, A., Yin, Z., et al.: Reasonedit: Towards reasoning-enhanced image editing models. arXiv preprint arXiv:2511.22625 (2025)

work page arXiv 2025

[64] [65]

In: Proceedings of the 33rd ACM International Conference on Mul- timedia

Zeng, B., Yang, L., Liu, J., Xu, M., Zhang, Y., Wan, P., Zhang, W., Yan, S.: Editworld: Simulating world dynamics for instruction-following image editing. In: Proceedings of the 33rd ACM International Conference on Mul- timedia. pp. 12674–12681 (2025)

2025

[65] [66]

In: CVPR (2018)

Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreason- able effectiveness of deep features as a perceptual metric. In: CVPR (2018)

2018

[66] [67]

arXiv preprint arXiv:2602.12322 (2026)

Zhang, Z., Yang, S., Hu, Q., Huang, L.J., Hou, J., Sun, Y., Lu, Y., Han, S.: Foreact: Steering your vla with efficient visual foresight planning. arXiv preprint arXiv:2602.12322 (2026)

work page arXiv 2026

[67] [68]

3D-VLA: A 3D Vision-Language-Action Generative World Model

Zhen, H., Qiu, X., Chen, P., Yang, J., Yan, X., Du, Y., Hong, Y., Gan, C.: 3d-vla: A 3d vision-language-action generative world model. arXiv preprint arXiv:2403.09631 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[68] [69]

arXiv preprint arXiv:2504.20995 (2025)

Zhen, H., Sun, Q., Zhang, H., Li, J., Zhou, S., Du, Y., Gan, C.: Tesser- act: Learning 4d embodied world models. arXiv preprint arXiv:2504.20995 (2025)

work page arXiv 2025

[69] [70]

In: The Thirteenth Inter- national Conference on Learning Representations (2025)

Zhou, C., Yu, L., Babu, A., Tirumala, K., Yasunaga, M., Shamis, L., Kahn, J., Ma, X., Zettlemoyer, L., Levy, O.: Transfusion: Predict the next token and diffuse images with one multi-modal model. In: The Thirteenth Inter- national Conference on Learning Representations (2025)

2025