pith. sign in

arxiv: 2606.29013 · v1 · pith:B6FCSUC3new · submitted 2026-06-27 · 💻 cs.CV

Mural: Transferring LLM knowledge to image generation via Mixture-of-Transformers

Pith reviewed 2026-06-30 09:20 UTC · model grok-4.3

classification 💻 cs.CV
keywords text-to-image synthesislarge language modelsmixture of transformersdiffusion modelsemergent capabilitiesmultimodal learningfrozen modelsshared attention
0
0 comments X

The pith

A frozen LLM transfers its knowledge to guide image generation when attention is shared with a diffusion model in Mixture-of-Transformers, using only standard text-image pairs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether a pretrained large language model's intrinsic knowledge remains useful for directing text-to-image synthesis when the LLM stays frozen and training uses only ordinary text-image pairs. It connects the LLM to a diffusion generator through shared attention inside the Mixture-of-Transformers architecture. Experiments measure how much of the LLM's reasoning ability carries over and what new behaviors appear in the combined system. The results include competitive scores on standard benchmarks plus unexpected outputs such as cross-lingual generation and scenes built from emoji or ASCII art.

Core claim

Pretrained LLM knowledge can guide image synthesis under standard text-to-image training paradigms, without interleaved multimodal signals or explicit reasoning supervision, when a frozen reasoning-capable LLM is integrated with a diffusion-based image generator via shared attention within the Mixture-of-Transformers architecture; this produces strong benchmark performance and emergent behaviors absent from the training data.

What carries the argument

Mixture-of-Transformers architecture with shared attention, which keeps the frozen LLM's parameters active and able to shape the diffusion process during training on text-image pairs.

If this is right

  • The models reach 0.85 on GenEval, 86.75 on DPG-Bench, and 0.66 on WISE when inference-time reasoning is used.
  • Behaviors absent from training data emerge, including cross-lingual image generation, color-guided composition, and emoji or ASCII scene construction.
  • Generation can be steered by the LLM's world knowledge without explicit supervision.
  • The LLM's intrinsic knowledge remains accessible during ordinary text-to-image training.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Shared-attention designs may allow knowledge transfer across other modality pairs without large aligned datasets.
  • The results suggest that explicit multimodal pretraining is not always required for aligned generation capabilities.
  • Scaling the size of the frozen LLM could strengthen the observed emergent behaviors.
  • The same mechanism might support efficient multimodal extensions in resource-limited settings.

Load-bearing premise

Shared attention between the frozen LLM and the image generator keeps the LLM's knowledge accessible and transferable without any extra multimodal data or supervision.

What would settle it

An ablation that removes the shared attention links while keeping the LLM frozen would eliminate the reported benchmark gains and emergent behaviors.

Figures

Figures reproduced from arXiv: 2606.29013 by Achin Jain, Davide Modolo, Jie An, Siddharth Chaudhary.

Figure 1
Figure 1. Figure 1: Mural architecture overview. The LLM is frozen and only the image generation expert is trained with a flow matching loss. parameters remain fixed while a trainable image generation expert learns to synthesize images conditioned on representations produced by the LLM. In our experiments, we instantiate the frozen backbone with instruction￾tuned Qwen2.5 [34] or Qwen2.5-VL [29]. This design allows us to lever… view at source ↗
Figure 2
Figure 2. Figure 2: Cross-lingual generation: Mural generates images from prompts in English, Hindi, Chinese, and Italian despite training only on English image-text pairs. The frozen LLM’s multilingual understanding transfers to the generation branch via shared attention without using CoT at inference. Dense image generation expert. To better understand the role of shared attention in Mural, we train a dense baseline consist… view at source ↗
Figure 3
Figure 3. Figure 3: Color palette guidance: Given hex color codes (shown above each row), we com￾pare how well each model adheres to the specified color specifications. Mural better respects the color constraints compared to OmniGen2 and Bagel. generation results on English, Hindi, Chinese, and Italian prompts, demonstrat￾ing consistent quality across languages. English and Chinese prompts are taken from [7]. Most notably, we… view at source ↗
Figure 4
Figure 4. Figure 4: Emoji scene composition: All prompts use the prefix “Convert this ASCII art to a realistic image: ” followed by the emoji sequence shown in the first column. Mural produces realistic interpretations even without CoT, with quality improving further when CoT is enabled. and symbolic representations. We show more qualitative examples with different number of emojis. In row 2, Mural correctly infers the color … view at source ↗
Figure 5
Figure 5. Figure 5: Textual layout interpretation: All prompts use the prefix “Convert this drawing to a realistic image: ” followed by textual layouts in ASCII format (optionally with emoji) shown in the left column. Mural interprets textual layouts and generates cor￾responding realistic images. or the arrangement of a cup of coffee, a book, and flowers on a table (row 2). Results improve further with CoT enabled at inferenc… view at source ↗
Figure 6
Figure 6. Figure 6: Reasoning-based on world knowledge: When direct generation without CoT produces incorrect results, prompts requiring world knowledge are resolved correctly when CoT thinking is enabled at inference [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: GenEval scores across training steps for different model scales and initial￾izations. Qwen2.5-VL initialization provides faster convergence at 3B scale, but this advantage diminishes at 7B [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗
read the original abstract

Leveraging capabilities of large language models (LLMs) in text-to-image (T2I) synthesis is an important research direction. In this work we investigate whether the knowledge of a frozen LLM can be effectively utilized in T2I generation when trained exclusively on standard text-image pairs. We integrate a frozen, reasoning-capable LLM with a diffusion-based image generator via shared attention within the Mixture-of-Transformers (MoT) architecture. Our experiments span two critical questions: (1) what degree of the LLM's intrinsic knowledge remains accessible during T2I training, and (2) what novel capabilities emerge in the resulting system. Across established benchmarks, our models achieve strong performance among unified understanding-generation systems: 0.85 on GenEval, 86.75 on DPG-Bench, and 0.66 on WISE with inference-time reasoning, using only text-image data. Remarkably, we uncover emergent behaviors absent from training data, including cross-lingual image generation, color-guided composition, emoji / ASCII scene construction, and generation directed by world knowledge. These results demonstrate that pretrained LLM knowledge can guide image synthesis under standard text-to-image training paradigms, without interleaved multimodal signals or explicit reasoning supervision. Our findings open new avenues for harnessing frozen model capabilities in resource-constrained multimodal learning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces Mural, which integrates a frozen reasoning-capable LLM into a diffusion-based T2I generator via shared attention in the Mixture-of-Transformers (MoT) architecture. Trained exclusively on standard text-image pairs (no interleaved multimodal data or explicit reasoning supervision), it reports strong benchmark results among unified systems (0.85 GenEval, 86.75 DPG-Bench, 0.66 WISE with inference-time reasoning) and claims emergent capabilities including cross-lingual generation, color-guided composition, emoji/ASCII scene construction, and world-knowledge-directed outputs. The central thesis is that pretrained LLM knowledge remains accessible and useful for guiding image synthesis under these constraints.

Significance. If the isolation of LLM knowledge transfer holds, the result would be significant for multimodal learning: it would demonstrate that frozen LLM capabilities can be transferred to generation tasks using only text-image pairs and shared attention, without the cost of multimodal pretraining or interleaved signals. This could enable more resource-efficient unified models and explain emergent behaviors arising from joint training. The reported benchmark numbers and listed emergent behaviors, if robust, would support broader claims about leveraging existing model knowledge in constrained settings.

major comments (2)
  1. [Abstract / experimental setup] Abstract / experimental setup: the central claim that 'pretrained LLM knowledge can guide image synthesis' via shared attention in MoT requires evidence that the gains derive from the LLM's intrinsic (pretrained) knowledge rather than from joint MoT training with any sufficiently expressive text encoder. No ablation is described that replaces the frozen pretrained LLM with a randomly initialized transformer of matched capacity while holding the MoT architecture, shared attention, and diffusion training fixed. Without this control, the benchmark scores and emergent behaviors remain compatible with the alternative that any expressive text encoder integrated via shared attention would yield similar joint-training effects.
  2. [Abstract] Abstract: the reported scores (0.85 GenEval, 86.75 DPG-Bench, 0.66 WISE) and emergent behaviors are presented as evidence for the accessibility of LLM knowledge, yet the description contains no controls, baseline comparisons, or error analysis that would allow attribution to the frozen LLM component specifically. This is load-bearing for the two critical questions posed in the abstract.
minor comments (1)
  1. [Abstract] The abstract states results 'with inference-time reasoning' for the WISE score but does not specify the exact mechanism or whether it relies on the LLM component.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive feedback emphasizing the need for stronger controls to attribute results specifically to pretrained LLM knowledge. We address the major comments point by point below.

read point-by-point responses
  1. Referee: [Abstract / experimental setup] Abstract / experimental setup: the central claim that 'pretrained LLM knowledge can guide image synthesis' via shared attention in MoT requires evidence that the gains derive from the LLM's intrinsic (pretrained) knowledge rather than from joint MoT training with any sufficiently expressive text encoder. No ablation is described that replaces the frozen pretrained LLM with a randomly initialized transformer of matched capacity while holding the MoT architecture, shared attention, and diffusion training fixed. Without this control, the benchmark scores and emergent behaviors remain compatible with the alternative that any expressive text encoder integrated via shared attention would yield similar joint-training effects.

    Authors: We agree that an ablation replacing the frozen pretrained LLM with a randomly initialized transformer of matched capacity would more conclusively isolate the contribution of pretraining. Our experiments demonstrate that the frozen LLM enables strong benchmark performance and emergent behaviors under standard text-image training, but we acknowledge this does not rule out similar effects from any expressive encoder. We will add a limitations discussion noting this gap and the computational cost of such controls. revision: partial

  2. Referee: [Abstract] Abstract: the reported scores (0.85 GenEval, 86.75 DPG-Bench, 0.66 WISE) and emergent behaviors are presented as evidence for the accessibility of LLM knowledge, yet the description contains no controls, baseline comparisons, or error analysis that would allow attribution to the frozen LLM component specifically. This is load-bearing for the two critical questions posed in the abstract.

    Authors: The abstract frames the results as arising from integration of a frozen reasoning-capable LLM. We will revise the abstract and main text to more precisely qualify the claims, explicitly note the absence of random-initialization controls, and add a brief discussion of alternative explanations consistent with joint training effects. revision: partial

standing simulated objections not resolved
  • Absence of an ablation replacing the pretrained LLM with a randomly initialized transformer of matched capacity (requires new large-scale experiments beyond current resources).

Circularity Check

0 steps flagged

No circularity: empirical claims rest on experimental outcomes, not definitional reductions

full rationale

The manuscript describes an architecture (MoT with shared attention between frozen LLM and diffusion generator) and reports benchmark scores plus emergent behaviors from training on text-image pairs. No equations, parameter-fitting steps presented as predictions, or load-bearing self-citations appear in the provided text. The central claim—that pretrained LLM knowledge remains accessible and useful—is framed as an empirical finding rather than a derivation that reduces to its own inputs by construction. This is the normal case of a self-contained experimental paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Review is limited to the abstract; no explicit free parameters, axioms, or invented entities are identifiable from the given information.

pith-pipeline@v0.9.1-grok · 5771 in / 1011 out tokens · 29947 ms · 2026-06-30T09:20:48.051869+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

36 extracted references · 20 canonical work pages · 20 internal anchors

  1. [1]

    In: CVPR (2023)

    Bao, F., Nie, S., Xue, K., Cao, Y., Li, C., Su, H., Zhu, J.: All are worth words: A ViT backbone for diffusion models. In: CVPR (2023)

  2. [2]

    Computer Science (2023)

    Betker, J., Goh, G., Jing, L., Brooks, T., Wang, J., Li, L., Ouyang, L., Zhuang, J., Lee, J., Guo, Y., et al.: Improving image generation with better captions. Computer Science (2023)

  3. [3]

    https://blackforestlabs.ai/ (2024)

    Black Forest Labs: FLUX.1. https://blackforestlabs.ai/ (2024)

  4. [4]

    HiDream-I1: A High-Efficient Image Generative Foundation Model with Sparse Diffusion Transformer

    Cai, Q., Chen, J., Chen, Y., Li, Y., Long, F., Pan, Y., Qiu, Z., Zhang, Y., Gao, F., Xu, P., et al.: HiDream-I1: A high-efficient image generative foundation model with sparse diffusion transformer. arXiv preprint arXiv:2505.22705 (2025)

  5. [5]

    Chameleon: Mixed-Modal Early-Fusion Foundation Models

    Chameleon Team: Chameleon: Mixed-modal early-fusion foundation models. arXiv preprint arXiv:2405.09818 (2024)

  6. [6]

    Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling

    Chen, X., Wu, Z., Liu, X., Pan, Z., Liu, W., Xie, Z., Yu, X., Ruan, C.: Janus-pro: Unified multimodal understanding and generation with data and model scaling. arXiv preprint arXiv:2501.17811 (2025)

  7. [7]

    Emerging Properties in Unified Multimodal Pretraining

    Deng, C., Zhu, D., Li, K., Gou, C., Li, F., Wang, Z., Zhong, S., Yu, W., Nie, X., Song, Z., et al.: Emerging properties in unified multimodal pretraining. arXiv preprint arXiv:2505.14683 (2025)

  8. [8]

    In: ICML (2024)

    Esser, P., Kulal, S., Blattmann, A., Entezari, R., Müller, J., Saini, H., Levi, Y., Lorenz, D., Sauer, A., Boesel, F., et al.: Scaling rectified flow transformers for high-resolution image synthesis. In: ICML (2024)

  9. [9]

    Seedream 3.0 Technical Report

    Gao, Y., Gong, L., Guo, Q., Hou, X., Lai, Z., Li, F., Li, L., Lian, X., Liao, C., Liu, L., et al.: Seedream 3.0 technical report. arXiv preprint arXiv:2504.11346 (2025)

  10. [10]

    NeurIPS (2023)

    Ghosh, D., Hajishirzi, H., Schmidt, L.: Geneval: An object-focused framework for evaluating text-to-image alignment. NeurIPS (2023)

  11. [11]

    In: CVPR (2025)

    Han, J., Liu, J., Jiang, Y., Yan, B., Zhang, Y., Yuan, Z., Peng, B., Liu, X.: Infinity: Scaling bitwise autoregressive modeling for high-resolution image synthesis. In: CVPR (2025)

  12. [12]

    NeurIPS (2020)

    Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. NeurIPS (2020)

  13. [13]

    ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment

    Hu, X., Wang, R., Fang, Y., Fu, B., Cheng, P., Yu, G.: ELLA: Equip diffusion mod- els with LLM for enhanced semantic alignment. arXiv preprint arXiv:2403.05135 (2024)

  14. [14]

    In: NeurIPS (2024)

    Li, T., Tian, Y., Li, H., Deng, M., He, K.: Autoregressive image generation without vector quantization. In: NeurIPS (2024)

  15. [15]

    Mixture-of-Transformers: A Sparse and Scalable Architecture for Multi-Modal Foundation Models

    Liang, W., Yu, L., Luo, L., Iyer, S., Dong, N., Zhou, C., Ghosh, G., Lewis, M., Yih, W.t., Zettlemoyer, L., et al.: Mixture-of-transformers: A sparse and scalable architecture for multi-modal foundation models. arXiv preprint arXiv:2411.04996 (2024) 16 A. Jain et al

  16. [16]

    UniWorld-V1: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation

    Lin, B., Li, Z., Cheng, X., Niu, Y., Ye, Y., He, X., Yuan, S., Yu, W., Wang, S., Ge, Y., et al.: Uniworld-v1: High-resolution semantic encoders for unified visual understanding and generation. arXiv preprint arXiv:2506.03147 (2025)

  17. [17]

    In: ICLR (2023)

    Lipman, Y., Chen, R.T.Q., Ben-Hamu, H., Nickel, M., Le, M.: Flow matching for generative modeling. In: ICLR (2023)

  18. [18]

    In: ICLR (2019)

    Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: ICLR (2019)

  19. [19]

    WISE: A World Knowledge-Informed Semantic Evaluation for Text-to-Image Generation

    Niu, Y., Ning, M., Zheng, M., Jin, W., Lin, B., Jin, P., Liao, J., Feng, C., Ning, K., Zhu, B., et al.: WISE: A world knowledge-informed semantic evaluation for text-to-image generation. arXiv preprint arXiv:2503.07265 (2025)

  20. [20]

    Cosmos 3: Omnimodal World Models for Physical AI

    Nvidia: Cosmos 3: Omnimodal World Models for Physical AI. arXiv preprint arXiv:2606.02800 (2026)

  21. [21]

    https://openai.com/index/hello-gpt-4o/ (2024)

    OpenAI: GPT-4o. https://openai.com/index/hello-gpt-4o/ (2024)

  22. [22]

    Transfer between Modalities with MetaQueries

    Pan, X., Shukla, S.N., Singh, A., Zhao, Z., Mishra, S.K., Wang, J., Xu, Z., Chen, J., Li, K., Juefei-Xu, F., et al.: Transfer between modalities with metaqueries. arXiv preprint arXiv:2504.06256 (2025)

  23. [23]

    In: ICCV (2023)

    Peebles, W., Xie, S.: Scalable diffusion models with transformers. In: ICCV (2023)

  24. [24]

    In: ICLR (2024)

    Podell, D., English, Z., Lacey, K., Blattmann, A., Dockhorn, T., Müller, J., Penna, J., Rombach, R.: SDXL: Improving latent diffusion models for high-resolution im- age synthesis. In: ICLR (2024)

  25. [25]

    In: CVPR (2022)

    Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: CVPR (2022)

  26. [26]

    Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

    Shazeer, N., Mirhoseini, A., Maziarz, K., Davis, A., Le, Q., Hinton, G., Dean, J.: Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538 (2017)

  27. [27]

    In: NeurIPS (2025)

    Shi, W., Han, X., Zhou, C., Liang, W., Lin, X., Zettlemoyer, L., Yu, L.: LMFu- sion: Adapting pretrained language models for multimodal generation. In: NeurIPS (2025)

  28. [28]

    Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation

    Sun, P., Jiang, Y., Chen, S., Zhang, S., Peng, B., Luo, P., Yuan, Z.: Autoregres- sive model beats diffusion: Llama for scalable image generation. arXiv preprint arXiv:2406.06525 (2024)

  29. [29]

    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

    Wang, P., Bai, S., Tan, S., Wang, S., Fan, Z., Bai, J., Chen, K., Liu, X., Wang, J., Ge, W., et al.: Qwen2-VL: Enhancing vision-language model’s perception of the world at any resolution. arXiv preprint arXiv:2409.12191 (2024)

  30. [30]

    Emu3: Next-Token Prediction is All You Need

    Wang, X., Zhang, X., Luo, Z., Sun, Q., Cui, Y., Wang, J., Zhang, F., Wang, Y., Li, Z., Yu, Q., et al.: Emu3: Next-token prediction is all you need. arXiv preprint arXiv:2409.18869 (2024)

  31. [31]

    Qwen-Image Technical Report

    Wu, C., Li, J., Zhou, J., Lin, J., Gao, K., Yan, K., ming Yin, S., Bai, S., Xu, X., Chen, Y., Chen, Y., Tang, Z., Zhang, Z., Wang, Z., Yang, A., Yu, B., Cheng, C., Liu, D., Li, D., Zhang, H., Meng, H., Wei, H., Ni, J., Chen, K., Cao, K., Peng, L., Qu, L., Wu, M., Wang, P., Yu, S., Wen, T., Feng, W., Xu, X., Wang, Y., Zhang, Y., Zhu, Y., Wu, Y., Cai, Y., L...

  32. [32]

    OmniGen2: Towards Instruction-Aligned Multimodal Generation

    Wu, C., Zheng, P., Yan, R., Xiao, S., Luo, X., Wang, Y., Li, W., Jiang, X., Liu, Y., Zhou, J., et al.: Omnigen2: Exploration to advanced multimodal generation. arXiv preprint arXiv:2506.18871 (2025)

  33. [33]

    Show-o: One Single Transformer to Unify Multimodal Understanding and Generation

    Xie, J., Mao, W., Bai, Z., Zhang, D.J., Wang, W., Lin, K.Q., Gu, Y., Chen, Z., Yang, Z., Shou, M.Z.: Show-o: One single transformer to unify multimodal under- standing and generation. arXiv preprint arXiv:2408.12528 (2024)

  34. [34]

    Qwen2.5 Technical Report

    Yang, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Li, C., Liu, D., Huang, F., Wei, H., et al.: Qwen2.5 technical report. arXiv preprint arXiv:2412.15115 (2024) Mural 17

  35. [35]

    NeurIPS (2019)

    Zhang, B., Sennrich, R.: Root mean square layer normalization. NeurIPS (2019)

  36. [36]

    Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model

    Zhou, C., Yu, L., Babu, A., Tirumala, K., Yasunaga, M., Shamis, L., Kahn, J., Ma, X., Zettlemoyer, L., Levy, O.: Transfusion: Predict the next token and diffuse images with one multi-modal model. arXiv preprint arXiv:2408.11039 (2024)