pith. sign in

arxiv: 2605.16961 · v1 · pith:EL2CHLBOnew · submitted 2026-05-16 · 💻 cs.CV · cs.AI

Latent Action Control for Reasoning-Guided Unified Image Generation

Pith reviewed 2026-05-19 20:36 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords latent action controlunified multimodal modelsreasoning-guided image generationcompositional generationlatent trajectoriesflow-based generationattribute bindingspatial relations
0
0 comments X p. Extension
pith:EL2CHLBO Add to your LaTeX paper What is a Pith Number?
\usepackage{pith}
\pithnumber{EL2CHLBO}

Prints a linked pith:EL2CHLBO badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more

The pith

Latent Action Control turns inferred reasoning into hidden continuous actions that guide image generation inside unified models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Unified multimodal models often understand prompts well but fail to apply that understanding when generating images. Latent Action Control addresses this by representing reasoning steps as unobserved latent action trajectories that get injected directly into the generator's hidden conditioning stream. These trajectories handle planning, internal drafting, diagnosis, and refinement without emitting any visible reasoning tokens or intermediate pictures. The method learns the actions from semantic priors and visual feedback using variational alignment followed by reinforcement learning, then demonstrates gains on compositional and knowledge-grounded benchmarks.

Core claim

Latent Action Control (LAC) makes reasoning actionable inside a unified generator by rolling out role-structured latent trajectories for planning, internal visual drafting, diagnosis, and refinement, then injecting these actions into the hidden stream that conditions flow-based image generation. The trajectories remain unobserved during inference and are instead learned through prior-guided variational latent action alignment from training-only rendered semantic priors, draft image features, and supervised halting signals, followed by Latent-Flow GRPO to align the latent-to-image rollout with terminal visual feedback. This supplies a direct control path from inferred relations, bindings, and

What carries the argument

Latent action trajectories: role-structured hidden continuous actions for planning, drafting, diagnosis, and refinement that are injected into the generator's conditioning stream without producing explicit tokens or images.

If this is right

  • Unified generators achieve stronger control over spatial relations, attribute binding, and world-knowledge elements in the output image.
  • Reasoning cues become directly actionable during generation instead of remaining only in the model's internal encodings.
  • Performance improves on GenEval, WISE, and T2I-CompBench without requiring explicit reasoning outputs or extra inference steps.
  • Ablations confirm that the action trajectories are consumed inside the generation process.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same latent-action mechanism could be tested in video or 3D generation tasks where planning steps need to enforce temporal or geometric consistency.
  • If the action trajectories prove causally effective, similar hidden-action modules might improve controllability in other unified models that combine understanding and synthesis.
  • Explicit role-specific interventions on the trajectories could become a diagnostic tool for identifying which reasoning stage is failing in a given prompt.

Load-bearing premise

The learned latent action trajectories are actually consumed by the generator and causally affect the final image rather than being ignored or bypassed.

What would settle it

A direct intervention experiment that modifies specific dimensions of the learned latent action trajectory mid-generation and checks whether the resulting images change in ways predicted by the planning, drafting, or refinement roles.

Figures

Figures reproduced from arXiv: 2605.16961 by Fuxiang Zhai, Jianyu Lai, Lei Zhu, Shuaibo Li, Sixiang Chen, Tengjun Huang, Yingjin Li.

Figure 1
Figure 1. Figure 1: Overview of LAC. The model rolls out a role-structured latent action trajectory with four hidden roles, conditions image synthesis on the resulting hidden states, and trains the hidden action interface with prior-guided latent action alignment followed by outcome-level Latent-Flow GRPO. Let Hθ(c, Z, <end>) denote the conditioning hidden states obtained after prompt encoding, latent action injection, and se… view at source ↗
Figure 2
Figure 2. Figure 2: Inference-time latent intervention. Zero and random latents test dependence on action [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: GenEval drops under role re￾movals and fixed latent budgets. We next examine which roles and computation budgets drive compositional control [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
read the original abstract

Unified multimodal models can encode visual understanding and image generation within a shared backbone, yet understanding does not automatically translate into control: models may infer objects, relations, or knowledge cues but fail to instantiate them in the generated image. We propose Latent Action Control (LAC), which makes reasoning actionable by representing it as hidden continuous actions inside a unified generator. Given a prompt, LAC rolls out a role-structured latent trajectory for planning, internal visual drafting, diagnosis, and refinement, and injects these actions into the hidden stream that conditions flow-based generation, without producing reasoning tokens or intermediate images. Since such action trajectories are unobserved, LAC learns them through prior-guided variational latent action alignment from training-only rendered semantic priors, draft image features, and supervised halting signals, followed by Latent-Flow GRPO to align the latent-to-image rollout with terminal visual feedback. This provides a control path from inferred relations, bindings, and knowledge cues to the generation process. Instantiated on BAGEL-7B-MoT, LAC consistently improves compositional and knowledge-grounded generation across GenEval, WISE, and T2I-CompBench, with the largest gains on spatial relations, attribute binding, and world-knowledge-sensitive prompts. Ablations and latent interventions show that the learned action trajectory is consumed by the generator, suggesting that unified generation benefits when understanding is not only encoded, but made actionable during generation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes Latent Action Control (LAC) to address the gap between visual understanding and controllable generation in unified multimodal models. LAC represents reasoning as unobserved hidden continuous latent action trajectories (for planning, internal drafting, diagnosis, and refinement) that are injected directly into the hidden conditioning stream of a flow-based generator. These trajectories are learned via prior-guided variational latent action alignment (using training-only rendered semantic priors, draft image features, and supervised halting signals) followed by Latent-Flow GRPO to align the latent-to-image rollout with terminal visual feedback. Instantiated on BAGEL-7B-MoT, the method reports consistent gains on GenEval, WISE, and T2I-CompBench, with largest improvements on spatial relations, attribute binding, and world-knowledge prompts; ablations and latent interventions are cited to show that the trajectories are consumed by the generator.

Significance. If the central claim holds, LAC provides a concrete mechanism for making inferred relations, bindings, and knowledge cues actionable during generation without emitting reasoning tokens or intermediate images. This could meaningfully advance unified models by closing the understanding-to-control loop. The reported benchmark gains on compositional tasks supply empirical motivation, and the ablations offer partial grounding, though the work does not include machine-checked proofs, fully open reproducible code, or parameter-free derivations.

major comments (3)
  1. [Experiments / Latent Interventions] The latent intervention experiments (referenced in the abstract and experiments section) are load-bearing for the claim that action trajectories are causally consumed by the generator and drive the observed gains. However, the manuscript provides no description of intervention mechanics: how trajectories are altered, at which denoising steps the interventions occur, or what controls isolate the action path from the variational alignment and GRPO objectives. Without these details, gains could plausibly arise from the training objectives alone rather than from actionable control.
  2. [§3 (Method)] §3 (Method), the definitions of prior-guided variational latent action alignment and Latent-Flow GRPO: the manuscript does not supply the full equations or show that the learned trajectories possess independent grounding beyond the supervised priors and terminal feedback. This leaves open the possibility that the trajectories reduce to fitted quantities, undermining the claim of a distinct control path from relations to generation.
  3. [Experiments] Experiments section: reported benchmark improvements lack error bars, exact train/validation splits, and analysis of how post-hoc choices (e.g., latent action dimensionality or trajectory length) affect results. This weakens confidence that the gains on spatial relations and attribute binding are robust and attributable to LAC rather than implementation specifics.
minor comments (2)
  1. [Abstract / §1] Abstract and §1: the acronym 'GRPO' is used before any expansion; define all acronyms at first use.
  2. [Figures] Figure captions (e.g., those showing latent trajectories or intervention results): ensure axis labels, step indices, and intervention conditions are fully legible and self-contained.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment point by point below, indicating the revisions made where the manuscript is updated.

read point-by-point responses
  1. Referee: [Experiments / Latent Interventions] The latent intervention experiments (referenced in the abstract and experiments section) are load-bearing for the claim that action trajectories are causally consumed by the generator and drive the observed gains. However, the manuscript provides no description of intervention mechanics: how trajectories are altered, at which denoising steps the interventions occur, or what controls isolate the action path from the variational alignment and GRPO objectives. Without these details, gains could plausibly arise from the training objectives alone rather than from actionable control.

    Authors: We agree that the original manuscript lacked sufficient detail on the intervention mechanics, which is necessary to substantiate the causal role of the action trajectories. In the revised version, we have added a dedicated subsection in the Experiments section describing the procedure: trajectories are altered either by adding Gaussian noise scaled to the learned latent variance or by substitution with samples drawn from the variational prior; interventions are performed at early denoising timesteps (corresponding to the planning and drafting phases); and control experiments include running interventions after freezing the GRPO component or ablating the variational alignment loss. These additions isolate the contribution of the latent control path from the training objectives alone. revision: yes

  2. Referee: [§3 (Method)] §3 (Method), the definitions of prior-guided variational latent action alignment and Latent-Flow GRPO: the manuscript does not supply the full equations or show that the learned trajectories possess independent grounding beyond the supervised priors and terminal feedback. This leaves open the possibility that the trajectories reduce to fitted quantities, undermining the claim of a distinct control path from relations to generation.

    Authors: We acknowledge that the initial submission omitted the complete equations for brevity. The revised §3 now includes the full mathematical formulation: the evidence lower bound for prior-guided variational latent action alignment (incorporating the training-only semantic priors, draft image features, and halting signals) and the Latent-Flow GRPO objective (the policy gradient term adapted to continuous latent trajectories and flow-based rollouts). We have also added a short analysis demonstrating that the trajectories carry information independent of the priors, shown via mutual information estimates and performance drops in ablations that remove the latent path while retaining the priors. revision: yes

  3. Referee: [Experiments] Experiments section: reported benchmark improvements lack error bars, exact train/validation splits, and analysis of how post-hoc choices (e.g., latent action dimensionality or trajectory length) affect results. This weakens confidence that the gains on spatial relations and attribute binding are robust and attributable to LAC rather than implementation specifics.

    Authors: We have revised the Experiments section to improve reporting rigor. Error bars (standard deviation across three random seeds) are now reported for all main results on GenEval, WISE, and T2I-CompBench. The train/validation splits are explicitly stated to follow the official benchmark partitions. We have also added a sensitivity analysis varying latent action dimensionality (64/128/256) and trajectory length (4/8/12), confirming that the gains on spatial relations and attribute binding remain stable and are not driven by particular post-hoc hyperparameter choices. revision: yes

Circularity Check

0 steps flagged

No significant circularity; LAC method is empirically grounded

full rationale

The paper proposes Latent Action Control (LAC) as a training procedure that learns unobserved latent action trajectories via prior-guided variational alignment from semantic priors and draft features, followed by Latent-Flow GRPO using terminal visual feedback, then injects them into the generator's hidden conditioning stream. Claims of improved compositional and knowledge-grounded generation are supported by results on external benchmarks (GenEval, WISE, T2I-CompBench) plus ablations and interventions. No equations, derivations, or self-citations are presented that reduce any prediction or result to the inputs by construction. The approach introduces independent content through its novel latent trajectory representation and optimization objectives, evaluated against standard metrics rather than tautologically.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The approach rests on the existence of useful semantic priors and draft features that can supervise latent actions, plus the assumption that flow-based generation can consume continuous hidden actions without explicit tokens.

free parameters (1)
  • latent action dimensionality and trajectory length
    Chosen to represent planning, drafting, diagnosis, and refinement steps; no specific value given but required for the hidden trajectory.
axioms (1)
  • domain assumption Reasoning cues can be represented as continuous hidden actions that condition generation without producing tokens or images.
    Stated in the description of how LAC injects actions into the hidden stream.
invented entities (1)
  • latent action trajectory no independent evidence
    purpose: To encode planning, internal visual drafting, diagnosis, and refinement as hidden continuous actions.
    New postulated structure introduced to bridge understanding and generation.

pith-pipeline@v0.9.0 · 5795 in / 1229 out tokens · 33292 ms · 2026-05-19T20:36:35.233712+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

49 extracted references · 49 canonical work pages · 20 internal anchors

  1. [1]

    Improving image generation with better captions.Computer Science

    James Betker, Gabriel Goh, Li Jing, Tim Brooks, Jianfeng Wang, Linjie Li, Long Ouyang, Juntang Zhuang, Joyce Lee, Yufei Guo, et al. Improving image generation with better captions.Computer Science. https://cdn. openai. com/papers/dall-e-3. pdf, 2(3):8, 2023

  2. [2]

    Training Diffusion Models with Reinforcement Learning

    Kevin Black, Michael Janner, Yilun Du, Ilya Kostrikov, and Sergey Levine. Training diffusion models with reinforcement learning.arXiv preprint arXiv:2305.13301, 2023

  3. [3]

    Flux.https://github.com/black-forest-labs/flux, 2024

    Black Forest Labs. Flux.https://github.com/black-forest-labs/flux, 2024

  4. [4]

    Show, don’t tell: Morphing latent reasoning into image generation

    Harold Haodong Chen, Xinxiang Yin, Wen-Jie Shu, Hongfei Zhang, Zixin Zhang, Chenfei Liao, Litao Guo, Qifeng Chen, and Ying-Cong Chen. Show, don’t tell: Morphing latent reasoning into image generation. arXiv preprint arXiv:2602.02227, 2026

  5. [5]

    Pixart- σ: Weak-to-strong training of diffusion transformer for 4k text-to-image generation

    Junsong Chen, Chongjian Ge, Enze Xie, Yue Wu, Lewei Yao, Xiaozhe Ren, Zhongdao Wang, Ping Luo, Huchuan Lu, and Zhenguo Li. Pixart- σ: Weak-to-strong training of diffusion transformer for 4k text-to-image generation. InEuropean Conference on Computer Vision, pages 74–91. Springer, 2024

  6. [6]

    Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling

    Xiaokang Chen, Zhiyu Wu, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, and Chong Ruan. Janus-pro: Unified multimodal understanding and generation with data and model scaling.arXiv preprint arXiv:2501.17811, 2025

  7. [7]

    Emerging Properties in Unified Multimodal Pretraining

    Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, et al. Emerging properties in unified multimodal pretraining.arXiv preprint arXiv:2505.14683, 2025

  8. [8]

    Scaling rectified flow transformers for high-resolution image synthesis

    Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis. InForty-first international conference on machine learning, 2024

  9. [9]

    Dpok: Reinforcement learning for fine-tuning text-to-image diffusion models.Advances in Neural Information Processing Systems, 36:79858–79885, 2023

    Ying Fan, Olivia Watkins, Yuqing Du, Hao Liu, Moonkyung Ryu, Craig Boutilier, Pieter Abbeel, Mo- hammad Ghavamzadeh, Kangwook Lee, and Kimin Lee. Dpok: Reinforcement learning for fine-tuning text-to-image diffusion models.Advances in Neural Information Processing Systems, 36:79858–79885, 2023

  10. [10]

    SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation

    Yuying Ge, Sijie Zhao, Jinguo Zhu, Yixiao Ge, Kun Yi, Lin Song, Chen Li, Xiaohan Ding, and Ying Shan. Seed-x: Multimodal models with unified multi-granularity comprehension and generation.arXiv preprint arXiv:2404.14396, 2024

  11. [11]

    Geneval: An object-focused framework for evaluating text-to-image alignment.Advances in Neural Information Processing Systems, 36:52132–52152, 2023

    Dhruba Ghosh, Hannaneh Hajishirzi, and Ludwig Schmidt. Geneval: An object-focused framework for evaluating text-to-image alignment.Advances in Neural Information Processing Systems, 36:52132–52152, 2023

  12. [12]

    Thinking-while-generating: Interleaving textual reasoning throughout visual generation

    Ziyu Guo, Renrui Zhang, Hongyu Li, Manyuan Zhang, Xinyan Chen, Sifan Wang, Yan Feng, Peng Pei, and Pheng-Ann Heng. Thinking-while-generating: Interleaving textual reasoning throughout visual generation. arXiv preprint arXiv:2511.16671, 2025

  13. [13]

    Unicorn: Towards self-improving unified multimodal models through self-generated supervision.arXiv preprint arXiv:2601.03193, 2026

    Ruiyan Han, Zhen Fang, XinYu Sun, Yuchen Ma, Ziheng Wang, Yu Zeng, Zehui Chen, Lin Chen, Wenxuan Huang, Wei-Jie Xu, et al. Unicorn: Towards self-improving unified multimodal models through self-generated supervision.arXiv preprint arXiv:2601.03193, 2026

  14. [14]

    Training Large Language Models to Reason in a Continuous Latent Space

    Shibo Hao, Sainbayar Sukhbaatar, DiJia Su, Xian Li, Zhiting Hu, Jason Weston, and Yuandong Tian. Training large language models to reason in a continuous latent space, 2024.URL https://arxiv. org/abs/2412.06769, 98, 2022

  15. [15]

    Pick-a- pic: An open dataset of user preferences for text-to-image generation.Advances in neural information processing systems, 36:36652–36663, 2023

    Yuval Kirstain, Adam Polyak, Uriel Singer, Shahbuland Matiana, Joe Penna, and Omer Levy. Pick-a- pic: An open dataset of user preferences for text-to-image generation.Advances in neural information processing systems, 36:36652–36663, 2023

  16. [16]

    Latent Visual Reasoning

    Bangzheng Li, Ximeng Sun, Jiang Liu, Ze Wang, Jialian Wu, Xiaodong Yu, Hao Chen, Emad Barsoum, Muhao Chen, and Zicheng Liu. Latent visual reasoning.arXiv preprint arXiv:2509.24251, 2025

  17. [17]

    Playground v2.5: Three Insights towards Enhancing Aesthetic Quality in Text-to-Image Generation

    Daiqing Li, Aleks Kamko, Ehsan Akhgari, Ali Sabet, Linmiao Xu, and Suhail Doshi. Playground v2. 5: Three insights towards enhancing aesthetic quality in text-to-image generation.arXiv preprint arXiv:2402.17245, 2024

  18. [18]

    Editthinker: Unlocking iterative reasoning for any image editor.arXiv preprint arXiv:2512.05965, 2025

    Hongyu Li, Manyuan Zhang, Dian Zheng, Ziyu Guo, Yimeng Jia, Kaituo Feng, Hao Yu, Yexin Liu, Yan Feng, Peng Pei, et al. Editthinker: Unlocking iterative reasoning for any image editor.arXiv preprint arXiv:2512.05965, 2025. 10

  19. [19]

    Visual-cog: Stage-aware reinforcement learning with chain of guidance for text-to-image generation.arXiv preprint arXiv:2508.18032, 2025

    Yaqi Li, Peng Chen, Mingyang Han, Pi Bu, Haoxiang Shi, Runzhou Zhao, Yang Yao, Xuan Zhang, Jun Song, and Bo Zheng. Visual-cog: Stage-aware reinforcement learning with chain of guidance for text-to-image generation.arXiv preprint arXiv:2508.18032, 2025

  20. [20]

    Mogao: An Omni Foundation Model for Interleaved Multi-Modal Generation

    Chao Liao, Liyang Liu, Xun Wang, Zhengxiong Luo, Xinyu Zhang, Wenliang Zhao, Jie Wu, Liang Li, Zhi Tian, and Weilin Huang. Mogao: An omni foundation model for interleaved multi-modal generation. arXiv preprint arXiv:2505.05472, 2025

  21. [21]

    Imagegen-cot: Enhancing text-to-image in-context learning with chain-of-thought reasoning

    Jiaqi Liao, Zhengyuan Yang, Linjie Li, Dianqi Li, Kevin Lin, Yu Cheng, and Lijuan Wang. Imagegen-cot: Enhancing text-to-image in-context learning with chain-of-thought reasoning. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 17214–17223, 2025

  22. [22]

    Jarvisevo: Towards a self-evolving photo editing agent with synergistic editor-evaluator optimization.arXiv preprint arXiv:2511.23002, 2025

    Yunlong Lin, Linqing Wang, Kunjie Lin, Zixu Lin, Kaixiong Gong, Wenbo Li, Bin Lin, Zhenxi Li, Shiyi Zhang, Yuyang Peng, et al. Jarvisevo: Towards a self-evolving photo editing agent with synergistic editor-evaluator optimization.arXiv preprint arXiv:2511.23002, 2025

  23. [23]

    Flow-GRPO: Training Flow Matching Models via Online RL

    Jie Liu, Gongye Liu, Jiajun Liang, Yangguang Li, Jiaheng Liu, Xintao Wang, Pengfei Wan, Di Zhang, and Wanli Ouyang. Flow-grpo: Training flow matching models via online rl.arXiv preprint arXiv:2505.05470, 2025

  24. [24]

    UniGRPO: Unified policy optimization for reasoning-driven visual generation.arXiv preprint arXiv:2603.23500, 2026

    Jie Liu, Zilyu Ye, Linxiao Yuan, Shenhan Zhu, Yu Gao, Jie Wu, Kunchang Li, Xionghui Wang, Xiaonan Nie, Weilin Huang, et al. Unigrpo: Unified policy optimization for reasoning-driven visual generation. arXiv preprint arXiv:2603.23500, 2026

  25. [25]

    Understanding-in-generation: Reinforcing generative capability of unified model via infusing understanding into generation.arXiv preprint arXiv:2509.18639, 2025

    Yuanhuiyi Lyu, Chi Kit Wong, Chenfei Liao, Lutao Jiang, Xu Zheng, Zexin Lu, Linfeng Zhang, and Xuming Hu. Understanding-in-generation: Reinforcing generative capability of unified model via infusing understanding into generation.arXiv preprint arXiv:2509.18639, 2025

  26. [26]

    Milr: Improving multimodal image generation via test-time latent reasoning.arXiv preprint arXiv:2509.22761, 2025

    Yapeng Mi, Yanpeng Zhao, Hengli Li, Chenxi Li, Huimin Wu, Xiaojian Ma, Song-Chun Zhu, Ying Nian Wu, and Qing Li. Milr: Improving multimodal image generation via test-time latent reasoning.arXiv preprint arXiv:2509.22761, 2025

  27. [27]

    Does understanding inform generation in unified multimodal models? from analysis to path forward.arXiv preprint arXiv:2511.20561, 2025

    Yuwei Niu, Weiyang Jin, Jiaqi Liao, Chaoran Feng, Peng Jin, Bin Lin, Zongjian Li, Bin Zhu, Weihao Yu, and Li Yuan. Does understanding inform generation in unified multimodal models? from analysis to path forward.arXiv preprint arXiv:2511.20561, 2025

  28. [28]

    WISE: A World Knowledge-Informed Semantic Evaluation for Text-to-Image Generation

    Yuwei Niu, Munan Ning, Mengren Zheng, Weiyang Jin, Bin Lin, Peng Jin, Jiaqi Liao, Chaoran Feng, Kunpeng Ning, Bin Zhu, et al. Wise: A world knowledge-informed semantic evaluation for text-to-image generation.arXiv preprint arXiv:2503.07265, 2025

  29. [29]

    Transfer between Modalities with MetaQueries

    Xichen Pan, Satya Narayan Shukla, Aashu Singh, Zhuokai Zhao, Shlok Kumar Mishra, Jialiang Wang, Zhiyang Xu, Jiuhai Chen, Kunpeng Li, Felix Juefei-Xu, et al. Transfer between modalities with metaqueries. arXiv preprint arXiv:2504.06256, 2025

  30. [30]

    Tokenflow: Unified image tokenizer for multimodal understanding and generation

    Liao Qu, Huichao Zhang, Yiheng Liu, Xu Wang, Yi Jiang, Yiming Gao, Hu Ye, Daniel K Du, Zehuan Yuan, and Xinglong Wu. Tokenflow: Unified image tokenizer for multimodal understanding and generation. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 2545–2555, 2025

  31. [31]

    Generation enhances understanding in unified multimodal models via multi-representation generation

    Zihan Su, Hongyang Wei, Kangrui Cen, Yong Wang, Guanhua Chen, Chun Yuan, and Xiangxiang Chu. Generation enhances understanding in unified multimodal models via multi-representation generation. arXiv preprint arXiv:2601.21406, 2026

  32. [32]

    Chameleon: Mixed-Modal Early-Fusion Foundation Models

    Chameleon Team. Chameleon: Mixed-modal early-fusion foundation models, 2024.URL https://arxiv. org/abs/2405.09818, 9(8), 2024

  33. [33]

    Internvl-u: Democratizing unified multimodal models for understanding, reasoning, generation and editing.arXiv preprint arXiv:2603.09877, 2026

    Changyao Tian, Danni Yang, Guanzhou Chen, Erfei Cui, Zhaokai Wang, Yuchen Duan, Penghao Yin, Sitao Chen, Ganlin Yang, Mingxin Liu, et al. Internvl-u: Democratizing unified multimodal models for understanding, reasoning, generation and editing.arXiv preprint arXiv:2603.09877, 2026

  34. [34]

    Regular: Variational latent reasoning guided by rendered chain-of-thought.arXiv preprint arXiv:2601.23184, 2026

    Fanmeng Wang, Haotian Liu, Guojiang Zhao, Hongteng Xu, and Zhifeng Gao. Regular: Variational latent reasoning guided by rendered chain-of-thought.arXiv preprint arXiv:2601.23184, 2026

  35. [35]

    Monet: Reasoning in latent visual space beyond images and language.arXiv preprint arXiv:2511.21395, 2025

    Qixun Wang, Yang Shi, Yifei Wang, Yuanxing Zhang, Pengfei Wan, Kun Gai, Xianghua Ying, and Yisen Wang. Monet: Reasoning in latent visual space beyond images and language.arXiv preprint arXiv:2511.21395, 2025

  36. [36]

    Emu3: Next-Token Prediction is All You Need

    Xinlong Wang, Xiaosong Zhang, Zhengxiong Luo, Quan Sun, Yufeng Cui, Jinsheng Wang, Fan Zhang, Yueze Wang, Zhen Li, Qiying Yu, et al. Emu3: Next-token prediction is all you need.arXiv preprint arXiv:2409.18869, 2024. 11

  37. [37]

    Unified multi- modal chain-of-thought reward model through reinforcement fine-tuning.arXiv preprint arXiv:2505.03318, 2025

    Yibin Wang, Zhimin Li, Yuhang Zang, Chunyu Wang, Qinglin Lu, Cheng Jin, and Jiaqi Wang. Unified multi- modal chain-of-thought reward model through reinforcement fine-tuning.arXiv preprint arXiv:2505.03318, 2025

  38. [38]

    Render-of-Thought: Rendering Textual Chain-of-Thought as Images for Visual Latent Reasoning

    Yifan Wang, Shiyu Li, Peiming Li, Xiaochen Yang, Yang Tang, and Zheng Wei. Render-of-thought: Rendering textual chain-of-thought as images for visual latent reasoning.arXiv preprint arXiv:2601.14750, 2026

  39. [39]

    Janus: Decoupling visual encoding for unified multimodal understanding and generation

    Chengyue Wu, Xiaokang Chen, Zhiyu Wu, Yiyang Ma, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, Chong Ruan, et al. Janus: Decoupling visual encoding for unified multimodal understanding and generation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 12966–12977, 2025

  40. [40]

    Lavit: Aligning latent visual thoughts for multi-modal reasoning.arXiv preprint arXiv:2601.10129, 2026

    Linquan Wu, Tianxiang Jiang, Yifei Dong, Haoyu Yang, Fengji Zhang, Shichaang Meng, Ai Xuan, Linqi Song, and Jacky Keung. Lavit: Aligning latent visual thoughts for multi-modal reasoning.arXiv preprint arXiv:2601.10129, 2026

  41. [41]

    VILA-U: a Unified Foundation Model Integrating Visual Understanding and Generation

    Yecheng Wu, Zhuoyang Zhang, Junyu Chen, Haotian Tang, Dacheng Li, Yunhao Fang, Ligeng Zhu, Enze Xie, Hongxu Yin, Li Yi, et al. Vila-u: a unified foundation model integrating visual understanding and generation.arXiv preprint arXiv:2409.04429, 2024

  42. [42]

    Show-o: One Single Transformer to Unify Multimodal Understanding and Generation

    Jinheng Xie, Weijia Mao, Zechen Bai, David Junhao Zhang, Weihao Wang, Kevin Qinghong Lin, Yuchao Gu, Zhijie Chen, Zhenheng Yang, and Mike Zheng Shou. Show-o: One single transformer to unify multimodal understanding and generation.arXiv preprint arXiv:2408.12528, 2024

  43. [43]

    Show-o2: Improved Native Unified Multimodal Models

    Jinheng Xie, Zhenheng Yang, and Mike Zheng Shou. Show-o2: Improved native unified multimodal models.arXiv preprint arXiv:2506.15564, 2025

  44. [44]

    Imagereward: Learning and evaluating human preferences for text-to-image generation.Advances in Neural Information Processing Systems, 36:15903–15935, 2023

    Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai Li, Ming Ding, Jie Tang, and Yuxiao Dong. Imagereward: Learning and evaluating human preferences for text-to-image generation.Advances in Neural Information Processing Systems, 36:15903–15935, 2023

  45. [45]

    DanceGRPO: Unleashing GRPO on Visual Generation

    Zeyue Xue, Jie Wu, Yu Gao, Fangyuan Kong, Lingting Zhu, Mengzhao Chen, Zhiheng Liu, Wei Liu, Qiushan Guo, Weilin Huang, et al. Dancegrpo: Unleashing grpo on visual generation.arXiv preprint arXiv:2505.07818, 2025

  46. [46]

    Visual-aware cot: Achieving high-fidelity visual consistency in unified models.arXiv preprint arXiv:2512.19686, 2025

    Zixuan Ye, Quande Liu, Cong Wei, Yuanxing Zhang, Xintao Wang, Pengfei Wan, Kun Gai, and Wenhan Luo. Visual-aware cot: Achieving high-fidelity visual consistency in unified models.arXiv preprint arXiv:2512.19686, 2025

  47. [47]

    Multimodal Latent Reasoning via Hierarchical Visual Cues Injection

    Yiming Zhang, Qiangyu Yan, Borui Jiang, and Kai Han. Multimodal latent reasoning via hierarchical visual cues injection.arXiv preprint arXiv:2602.05359, 2026

  48. [48]

    Soft thinking: Unlocking the reasoning potential of llms in continuous concept space

    Zhen Zhang, Xuehai He, Weixiang Yan, Ao Shen, Chenyang Zhao, Shuohang Wang, Yelong Shen, and Xin Eric Wang. Soft thinking: Unlocking the reasoning potential of llms in continuous concept space. arXiv preprint arXiv:2505.15778, 2025

  49. [49]

    Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model

    Chunting Zhou, Lili Yu, Arun Babu, Kushal Tirumala, Michihiro Yasunaga, Leonid Shamis, Jacob Kahn, Xuezhe Ma, Luke Zettlemoyer, and Omer Levy. Transfusion: Predict the next token and diffuse images with one multi-modal model.arXiv preprint arXiv:2408.11039, 2024. 12