Latent Action Control for Reasoning-Guided Unified Image Generation
Pith reviewed 2026-05-19 20:36 UTC · model grok-4.3
pith:EL2CHLBO Add to your LaTeX paper
What is a Pith Number?\usepackage{pith}
\pithnumber{EL2CHLBO}
Prints a linked pith:EL2CHLBO badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more
The pith
Latent Action Control turns inferred reasoning into hidden continuous actions that guide image generation inside unified models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Latent Action Control (LAC) makes reasoning actionable inside a unified generator by rolling out role-structured latent trajectories for planning, internal visual drafting, diagnosis, and refinement, then injecting these actions into the hidden stream that conditions flow-based image generation. The trajectories remain unobserved during inference and are instead learned through prior-guided variational latent action alignment from training-only rendered semantic priors, draft image features, and supervised halting signals, followed by Latent-Flow GRPO to align the latent-to-image rollout with terminal visual feedback. This supplies a direct control path from inferred relations, bindings, and
What carries the argument
Latent action trajectories: role-structured hidden continuous actions for planning, drafting, diagnosis, and refinement that are injected into the generator's conditioning stream without producing explicit tokens or images.
If this is right
- Unified generators achieve stronger control over spatial relations, attribute binding, and world-knowledge elements in the output image.
- Reasoning cues become directly actionable during generation instead of remaining only in the model's internal encodings.
- Performance improves on GenEval, WISE, and T2I-CompBench without requiring explicit reasoning outputs or extra inference steps.
- Ablations confirm that the action trajectories are consumed inside the generation process.
Where Pith is reading between the lines
- The same latent-action mechanism could be tested in video or 3D generation tasks where planning steps need to enforce temporal or geometric consistency.
- If the action trajectories prove causally effective, similar hidden-action modules might improve controllability in other unified models that combine understanding and synthesis.
- Explicit role-specific interventions on the trajectories could become a diagnostic tool for identifying which reasoning stage is failing in a given prompt.
Load-bearing premise
The learned latent action trajectories are actually consumed by the generator and causally affect the final image rather than being ignored or bypassed.
What would settle it
A direct intervention experiment that modifies specific dimensions of the learned latent action trajectory mid-generation and checks whether the resulting images change in ways predicted by the planning, drafting, or refinement roles.
Figures
read the original abstract
Unified multimodal models can encode visual understanding and image generation within a shared backbone, yet understanding does not automatically translate into control: models may infer objects, relations, or knowledge cues but fail to instantiate them in the generated image. We propose Latent Action Control (LAC), which makes reasoning actionable by representing it as hidden continuous actions inside a unified generator. Given a prompt, LAC rolls out a role-structured latent trajectory for planning, internal visual drafting, diagnosis, and refinement, and injects these actions into the hidden stream that conditions flow-based generation, without producing reasoning tokens or intermediate images. Since such action trajectories are unobserved, LAC learns them through prior-guided variational latent action alignment from training-only rendered semantic priors, draft image features, and supervised halting signals, followed by Latent-Flow GRPO to align the latent-to-image rollout with terminal visual feedback. This provides a control path from inferred relations, bindings, and knowledge cues to the generation process. Instantiated on BAGEL-7B-MoT, LAC consistently improves compositional and knowledge-grounded generation across GenEval, WISE, and T2I-CompBench, with the largest gains on spatial relations, attribute binding, and world-knowledge-sensitive prompts. Ablations and latent interventions show that the learned action trajectory is consumed by the generator, suggesting that unified generation benefits when understanding is not only encoded, but made actionable during generation.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes Latent Action Control (LAC) to address the gap between visual understanding and controllable generation in unified multimodal models. LAC represents reasoning as unobserved hidden continuous latent action trajectories (for planning, internal drafting, diagnosis, and refinement) that are injected directly into the hidden conditioning stream of a flow-based generator. These trajectories are learned via prior-guided variational latent action alignment (using training-only rendered semantic priors, draft image features, and supervised halting signals) followed by Latent-Flow GRPO to align the latent-to-image rollout with terminal visual feedback. Instantiated on BAGEL-7B-MoT, the method reports consistent gains on GenEval, WISE, and T2I-CompBench, with largest improvements on spatial relations, attribute binding, and world-knowledge prompts; ablations and latent interventions are cited to show that the trajectories are consumed by the generator.
Significance. If the central claim holds, LAC provides a concrete mechanism for making inferred relations, bindings, and knowledge cues actionable during generation without emitting reasoning tokens or intermediate images. This could meaningfully advance unified models by closing the understanding-to-control loop. The reported benchmark gains on compositional tasks supply empirical motivation, and the ablations offer partial grounding, though the work does not include machine-checked proofs, fully open reproducible code, or parameter-free derivations.
major comments (3)
- [Experiments / Latent Interventions] The latent intervention experiments (referenced in the abstract and experiments section) are load-bearing for the claim that action trajectories are causally consumed by the generator and drive the observed gains. However, the manuscript provides no description of intervention mechanics: how trajectories are altered, at which denoising steps the interventions occur, or what controls isolate the action path from the variational alignment and GRPO objectives. Without these details, gains could plausibly arise from the training objectives alone rather than from actionable control.
- [§3 (Method)] §3 (Method), the definitions of prior-guided variational latent action alignment and Latent-Flow GRPO: the manuscript does not supply the full equations or show that the learned trajectories possess independent grounding beyond the supervised priors and terminal feedback. This leaves open the possibility that the trajectories reduce to fitted quantities, undermining the claim of a distinct control path from relations to generation.
- [Experiments] Experiments section: reported benchmark improvements lack error bars, exact train/validation splits, and analysis of how post-hoc choices (e.g., latent action dimensionality or trajectory length) affect results. This weakens confidence that the gains on spatial relations and attribute binding are robust and attributable to LAC rather than implementation specifics.
minor comments (2)
- [Abstract / §1] Abstract and §1: the acronym 'GRPO' is used before any expansion; define all acronyms at first use.
- [Figures] Figure captions (e.g., those showing latent trajectories or intervention results): ensure axis labels, step indices, and intervention conditions are fully legible and self-contained.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment point by point below, indicating the revisions made where the manuscript is updated.
read point-by-point responses
-
Referee: [Experiments / Latent Interventions] The latent intervention experiments (referenced in the abstract and experiments section) are load-bearing for the claim that action trajectories are causally consumed by the generator and drive the observed gains. However, the manuscript provides no description of intervention mechanics: how trajectories are altered, at which denoising steps the interventions occur, or what controls isolate the action path from the variational alignment and GRPO objectives. Without these details, gains could plausibly arise from the training objectives alone rather than from actionable control.
Authors: We agree that the original manuscript lacked sufficient detail on the intervention mechanics, which is necessary to substantiate the causal role of the action trajectories. In the revised version, we have added a dedicated subsection in the Experiments section describing the procedure: trajectories are altered either by adding Gaussian noise scaled to the learned latent variance or by substitution with samples drawn from the variational prior; interventions are performed at early denoising timesteps (corresponding to the planning and drafting phases); and control experiments include running interventions after freezing the GRPO component or ablating the variational alignment loss. These additions isolate the contribution of the latent control path from the training objectives alone. revision: yes
-
Referee: [§3 (Method)] §3 (Method), the definitions of prior-guided variational latent action alignment and Latent-Flow GRPO: the manuscript does not supply the full equations or show that the learned trajectories possess independent grounding beyond the supervised priors and terminal feedback. This leaves open the possibility that the trajectories reduce to fitted quantities, undermining the claim of a distinct control path from relations to generation.
Authors: We acknowledge that the initial submission omitted the complete equations for brevity. The revised §3 now includes the full mathematical formulation: the evidence lower bound for prior-guided variational latent action alignment (incorporating the training-only semantic priors, draft image features, and halting signals) and the Latent-Flow GRPO objective (the policy gradient term adapted to continuous latent trajectories and flow-based rollouts). We have also added a short analysis demonstrating that the trajectories carry information independent of the priors, shown via mutual information estimates and performance drops in ablations that remove the latent path while retaining the priors. revision: yes
-
Referee: [Experiments] Experiments section: reported benchmark improvements lack error bars, exact train/validation splits, and analysis of how post-hoc choices (e.g., latent action dimensionality or trajectory length) affect results. This weakens confidence that the gains on spatial relations and attribute binding are robust and attributable to LAC rather than implementation specifics.
Authors: We have revised the Experiments section to improve reporting rigor. Error bars (standard deviation across three random seeds) are now reported for all main results on GenEval, WISE, and T2I-CompBench. The train/validation splits are explicitly stated to follow the official benchmark partitions. We have also added a sensitivity analysis varying latent action dimensionality (64/128/256) and trajectory length (4/8/12), confirming that the gains on spatial relations and attribute binding remain stable and are not driven by particular post-hoc hyperparameter choices. revision: yes
Circularity Check
No significant circularity; LAC method is empirically grounded
full rationale
The paper proposes Latent Action Control (LAC) as a training procedure that learns unobserved latent action trajectories via prior-guided variational alignment from semantic priors and draft features, followed by Latent-Flow GRPO using terminal visual feedback, then injects them into the generator's hidden conditioning stream. Claims of improved compositional and knowledge-grounded generation are supported by results on external benchmarks (GenEval, WISE, T2I-CompBench) plus ablations and interventions. No equations, derivations, or self-citations are presented that reduce any prediction or result to the inputs by construction. The approach introduces independent content through its novel latent trajectory representation and optimization objectives, evaluated against standard metrics rather than tautologically.
Axiom & Free-Parameter Ledger
free parameters (1)
- latent action dimensionality and trajectory length
axioms (1)
- domain assumption Reasoning cues can be represented as continuous hidden actions that condition generation without producing tokens or images.
invented entities (1)
-
latent action trajectory
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
LAC rolls out a role-structured latent trajectory for planning, internal visual drafting, diagnosis, and refinement, and injects these actions into the hidden stream that conditions flow-based generation
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanabsolute_floor_iff_bare_distinguishability unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
prior-guided variational latent action alignment from training-only rendered semantic priors, draft image features, and supervised halting signals, followed by Latent-Flow GRPO
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Improving image generation with better captions.Computer Science
James Betker, Gabriel Goh, Li Jing, Tim Brooks, Jianfeng Wang, Linjie Li, Long Ouyang, Juntang Zhuang, Joyce Lee, Yufei Guo, et al. Improving image generation with better captions.Computer Science. https://cdn. openai. com/papers/dall-e-3. pdf, 2(3):8, 2023
work page 2023
-
[2]
Training Diffusion Models with Reinforcement Learning
Kevin Black, Michael Janner, Yilun Du, Ilya Kostrikov, and Sergey Levine. Training diffusion models with reinforcement learning.arXiv preprint arXiv:2305.13301, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[3]
Flux.https://github.com/black-forest-labs/flux, 2024
Black Forest Labs. Flux.https://github.com/black-forest-labs/flux, 2024
work page 2024
-
[4]
Show, don’t tell: Morphing latent reasoning into image generation
Harold Haodong Chen, Xinxiang Yin, Wen-Jie Shu, Hongfei Zhang, Zixin Zhang, Chenfei Liao, Litao Guo, Qifeng Chen, and Ying-Cong Chen. Show, don’t tell: Morphing latent reasoning into image generation. arXiv preprint arXiv:2602.02227, 2026
-
[5]
Pixart- σ: Weak-to-strong training of diffusion transformer for 4k text-to-image generation
Junsong Chen, Chongjian Ge, Enze Xie, Yue Wu, Lewei Yao, Xiaozhe Ren, Zhongdao Wang, Ping Luo, Huchuan Lu, and Zhenguo Li. Pixart- σ: Weak-to-strong training of diffusion transformer for 4k text-to-image generation. InEuropean Conference on Computer Vision, pages 74–91. Springer, 2024
work page 2024
-
[6]
Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling
Xiaokang Chen, Zhiyu Wu, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, and Chong Ruan. Janus-pro: Unified multimodal understanding and generation with data and model scaling.arXiv preprint arXiv:2501.17811, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[7]
Emerging Properties in Unified Multimodal Pretraining
Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, et al. Emerging properties in unified multimodal pretraining.arXiv preprint arXiv:2505.14683, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[8]
Scaling rectified flow transformers for high-resolution image synthesis
Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis. InForty-first international conference on machine learning, 2024
work page 2024
-
[9]
Ying Fan, Olivia Watkins, Yuqing Du, Hao Liu, Moonkyung Ryu, Craig Boutilier, Pieter Abbeel, Mo- hammad Ghavamzadeh, Kangwook Lee, and Kimin Lee. Dpok: Reinforcement learning for fine-tuning text-to-image diffusion models.Advances in Neural Information Processing Systems, 36:79858–79885, 2023
work page 2023
-
[10]
SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation
Yuying Ge, Sijie Zhao, Jinguo Zhu, Yixiao Ge, Kun Yi, Lin Song, Chen Li, Xiaohan Ding, and Ying Shan. Seed-x: Multimodal models with unified multi-granularity comprehension and generation.arXiv preprint arXiv:2404.14396, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[11]
Dhruba Ghosh, Hannaneh Hajishirzi, and Ludwig Schmidt. Geneval: An object-focused framework for evaluating text-to-image alignment.Advances in Neural Information Processing Systems, 36:52132–52152, 2023
work page 2023
-
[12]
Thinking-while-generating: Interleaving textual reasoning throughout visual generation
Ziyu Guo, Renrui Zhang, Hongyu Li, Manyuan Zhang, Xinyan Chen, Sifan Wang, Yan Feng, Peng Pei, and Pheng-Ann Heng. Thinking-while-generating: Interleaving textual reasoning throughout visual generation. arXiv preprint arXiv:2511.16671, 2025
-
[13]
Ruiyan Han, Zhen Fang, XinYu Sun, Yuchen Ma, Ziheng Wang, Yu Zeng, Zehui Chen, Lin Chen, Wenxuan Huang, Wei-Jie Xu, et al. Unicorn: Towards self-improving unified multimodal models through self-generated supervision.arXiv preprint arXiv:2601.03193, 2026
-
[14]
Training Large Language Models to Reason in a Continuous Latent Space
Shibo Hao, Sainbayar Sukhbaatar, DiJia Su, Xian Li, Zhiting Hu, Jason Weston, and Yuandong Tian. Training large language models to reason in a continuous latent space, 2024.URL https://arxiv. org/abs/2412.06769, 98, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[15]
Yuval Kirstain, Adam Polyak, Uriel Singer, Shahbuland Matiana, Joe Penna, and Omer Levy. Pick-a- pic: An open dataset of user preferences for text-to-image generation.Advances in neural information processing systems, 36:36652–36663, 2023
work page 2023
-
[16]
Bangzheng Li, Ximeng Sun, Jiang Liu, Ze Wang, Jialian Wu, Xiaodong Yu, Hao Chen, Emad Barsoum, Muhao Chen, and Zicheng Liu. Latent visual reasoning.arXiv preprint arXiv:2509.24251, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[17]
Playground v2.5: Three Insights towards Enhancing Aesthetic Quality in Text-to-Image Generation
Daiqing Li, Aleks Kamko, Ehsan Akhgari, Ali Sabet, Linmiao Xu, and Suhail Doshi. Playground v2. 5: Three insights towards enhancing aesthetic quality in text-to-image generation.arXiv preprint arXiv:2402.17245, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[18]
Hongyu Li, Manyuan Zhang, Dian Zheng, Ziyu Guo, Yimeng Jia, Kaituo Feng, Hao Yu, Yexin Liu, Yan Feng, Peng Pei, et al. Editthinker: Unlocking iterative reasoning for any image editor.arXiv preprint arXiv:2512.05965, 2025. 10
-
[19]
Yaqi Li, Peng Chen, Mingyang Han, Pi Bu, Haoxiang Shi, Runzhou Zhao, Yang Yao, Xuan Zhang, Jun Song, and Bo Zheng. Visual-cog: Stage-aware reinforcement learning with chain of guidance for text-to-image generation.arXiv preprint arXiv:2508.18032, 2025
-
[20]
Mogao: An Omni Foundation Model for Interleaved Multi-Modal Generation
Chao Liao, Liyang Liu, Xun Wang, Zhengxiong Luo, Xinyu Zhang, Wenliang Zhao, Jie Wu, Liang Li, Zhi Tian, and Weilin Huang. Mogao: An omni foundation model for interleaved multi-modal generation. arXiv preprint arXiv:2505.05472, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[21]
Imagegen-cot: Enhancing text-to-image in-context learning with chain-of-thought reasoning
Jiaqi Liao, Zhengyuan Yang, Linjie Li, Dianqi Li, Kevin Lin, Yu Cheng, and Lijuan Wang. Imagegen-cot: Enhancing text-to-image in-context learning with chain-of-thought reasoning. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 17214–17223, 2025
work page 2025
-
[22]
Yunlong Lin, Linqing Wang, Kunjie Lin, Zixu Lin, Kaixiong Gong, Wenbo Li, Bin Lin, Zhenxi Li, Shiyi Zhang, Yuyang Peng, et al. Jarvisevo: Towards a self-evolving photo editing agent with synergistic editor-evaluator optimization.arXiv preprint arXiv:2511.23002, 2025
-
[23]
Flow-GRPO: Training Flow Matching Models via Online RL
Jie Liu, Gongye Liu, Jiajun Liang, Yangguang Li, Jiaheng Liu, Xintao Wang, Pengfei Wan, Di Zhang, and Wanli Ouyang. Flow-grpo: Training flow matching models via online rl.arXiv preprint arXiv:2505.05470, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[24]
Jie Liu, Zilyu Ye, Linxiao Yuan, Shenhan Zhu, Yu Gao, Jie Wu, Kunchang Li, Xionghui Wang, Xiaonan Nie, Weilin Huang, et al. Unigrpo: Unified policy optimization for reasoning-driven visual generation. arXiv preprint arXiv:2603.23500, 2026
-
[25]
Yuanhuiyi Lyu, Chi Kit Wong, Chenfei Liao, Lutao Jiang, Xu Zheng, Zexin Lu, Linfeng Zhang, and Xuming Hu. Understanding-in-generation: Reinforcing generative capability of unified model via infusing understanding into generation.arXiv preprint arXiv:2509.18639, 2025
-
[26]
Yapeng Mi, Yanpeng Zhao, Hengli Li, Chenxi Li, Huimin Wu, Xiaojian Ma, Song-Chun Zhu, Ying Nian Wu, and Qing Li. Milr: Improving multimodal image generation via test-time latent reasoning.arXiv preprint arXiv:2509.22761, 2025
-
[27]
Yuwei Niu, Weiyang Jin, Jiaqi Liao, Chaoran Feng, Peng Jin, Bin Lin, Zongjian Li, Bin Zhu, Weihao Yu, and Li Yuan. Does understanding inform generation in unified multimodal models? from analysis to path forward.arXiv preprint arXiv:2511.20561, 2025
-
[28]
WISE: A World Knowledge-Informed Semantic Evaluation for Text-to-Image Generation
Yuwei Niu, Munan Ning, Mengren Zheng, Weiyang Jin, Bin Lin, Peng Jin, Jiaqi Liao, Chaoran Feng, Kunpeng Ning, Bin Zhu, et al. Wise: A world knowledge-informed semantic evaluation for text-to-image generation.arXiv preprint arXiv:2503.07265, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[29]
Transfer between Modalities with MetaQueries
Xichen Pan, Satya Narayan Shukla, Aashu Singh, Zhuokai Zhao, Shlok Kumar Mishra, Jialiang Wang, Zhiyang Xu, Jiuhai Chen, Kunpeng Li, Felix Juefei-Xu, et al. Transfer between modalities with metaqueries. arXiv preprint arXiv:2504.06256, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[30]
Tokenflow: Unified image tokenizer for multimodal understanding and generation
Liao Qu, Huichao Zhang, Yiheng Liu, Xu Wang, Yi Jiang, Yiming Gao, Hu Ye, Daniel K Du, Zehuan Yuan, and Xinglong Wu. Tokenflow: Unified image tokenizer for multimodal understanding and generation. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 2545–2555, 2025
work page 2025
-
[31]
Generation enhances understanding in unified multimodal models via multi-representation generation
Zihan Su, Hongyang Wei, Kangrui Cen, Yong Wang, Guanhua Chen, Chun Yuan, and Xiangxiang Chu. Generation enhances understanding in unified multimodal models via multi-representation generation. arXiv preprint arXiv:2601.21406, 2026
-
[32]
Chameleon: Mixed-Modal Early-Fusion Foundation Models
Chameleon Team. Chameleon: Mixed-modal early-fusion foundation models, 2024.URL https://arxiv. org/abs/2405.09818, 9(8), 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[33]
Changyao Tian, Danni Yang, Guanzhou Chen, Erfei Cui, Zhaokai Wang, Yuchen Duan, Penghao Yin, Sitao Chen, Ganlin Yang, Mingxin Liu, et al. Internvl-u: Democratizing unified multimodal models for understanding, reasoning, generation and editing.arXiv preprint arXiv:2603.09877, 2026
-
[34]
Fanmeng Wang, Haotian Liu, Guojiang Zhao, Hongteng Xu, and Zhifeng Gao. Regular: Variational latent reasoning guided by rendered chain-of-thought.arXiv preprint arXiv:2601.23184, 2026
-
[35]
Qixun Wang, Yang Shi, Yifei Wang, Yuanxing Zhang, Pengfei Wan, Kun Gai, Xianghua Ying, and Yisen Wang. Monet: Reasoning in latent visual space beyond images and language.arXiv preprint arXiv:2511.21395, 2025
-
[36]
Emu3: Next-Token Prediction is All You Need
Xinlong Wang, Xiaosong Zhang, Zhengxiong Luo, Quan Sun, Yufeng Cui, Jinsheng Wang, Fan Zhang, Yueze Wang, Zhen Li, Qiying Yu, et al. Emu3: Next-token prediction is all you need.arXiv preprint arXiv:2409.18869, 2024. 11
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[37]
Yibin Wang, Zhimin Li, Yuhang Zang, Chunyu Wang, Qinglin Lu, Cheng Jin, and Jiaqi Wang. Unified multi- modal chain-of-thought reward model through reinforcement fine-tuning.arXiv preprint arXiv:2505.03318, 2025
-
[38]
Render-of-Thought: Rendering Textual Chain-of-Thought as Images for Visual Latent Reasoning
Yifan Wang, Shiyu Li, Peiming Li, Xiaochen Yang, Yang Tang, and Zheng Wei. Render-of-thought: Rendering textual chain-of-thought as images for visual latent reasoning.arXiv preprint arXiv:2601.14750, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[39]
Janus: Decoupling visual encoding for unified multimodal understanding and generation
Chengyue Wu, Xiaokang Chen, Zhiyu Wu, Yiyang Ma, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, Chong Ruan, et al. Janus: Decoupling visual encoding for unified multimodal understanding and generation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 12966–12977, 2025
work page 2025
-
[40]
Linquan Wu, Tianxiang Jiang, Yifei Dong, Haoyu Yang, Fengji Zhang, Shichaang Meng, Ai Xuan, Linqi Song, and Jacky Keung. Lavit: Aligning latent visual thoughts for multi-modal reasoning.arXiv preprint arXiv:2601.10129, 2026
-
[41]
VILA-U: a Unified Foundation Model Integrating Visual Understanding and Generation
Yecheng Wu, Zhuoyang Zhang, Junyu Chen, Haotian Tang, Dacheng Li, Yunhao Fang, Ligeng Zhu, Enze Xie, Hongxu Yin, Li Yi, et al. Vila-u: a unified foundation model integrating visual understanding and generation.arXiv preprint arXiv:2409.04429, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[42]
Show-o: One Single Transformer to Unify Multimodal Understanding and Generation
Jinheng Xie, Weijia Mao, Zechen Bai, David Junhao Zhang, Weihao Wang, Kevin Qinghong Lin, Yuchao Gu, Zhijie Chen, Zhenheng Yang, and Mike Zheng Shou. Show-o: One single transformer to unify multimodal understanding and generation.arXiv preprint arXiv:2408.12528, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[43]
Show-o2: Improved Native Unified Multimodal Models
Jinheng Xie, Zhenheng Yang, and Mike Zheng Shou. Show-o2: Improved native unified multimodal models.arXiv preprint arXiv:2506.15564, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[44]
Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai Li, Ming Ding, Jie Tang, and Yuxiao Dong. Imagereward: Learning and evaluating human preferences for text-to-image generation.Advances in Neural Information Processing Systems, 36:15903–15935, 2023
work page 2023
-
[45]
DanceGRPO: Unleashing GRPO on Visual Generation
Zeyue Xue, Jie Wu, Yu Gao, Fangyuan Kong, Lingting Zhu, Mengzhao Chen, Zhiheng Liu, Wei Liu, Qiushan Guo, Weilin Huang, et al. Dancegrpo: Unleashing grpo on visual generation.arXiv preprint arXiv:2505.07818, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[46]
Zixuan Ye, Quande Liu, Cong Wei, Yuanxing Zhang, Xintao Wang, Pengfei Wan, Kun Gai, and Wenhan Luo. Visual-aware cot: Achieving high-fidelity visual consistency in unified models.arXiv preprint arXiv:2512.19686, 2025
-
[47]
Multimodal Latent Reasoning via Hierarchical Visual Cues Injection
Yiming Zhang, Qiangyu Yan, Borui Jiang, and Kai Han. Multimodal latent reasoning via hierarchical visual cues injection.arXiv preprint arXiv:2602.05359, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[48]
Soft thinking: Unlocking the reasoning potential of llms in continuous concept space
Zhen Zhang, Xuehai He, Weixiang Yan, Ao Shen, Chenyang Zhao, Shuohang Wang, Yelong Shen, and Xin Eric Wang. Soft thinking: Unlocking the reasoning potential of llms in continuous concept space. arXiv preprint arXiv:2505.15778, 2025
-
[49]
Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model
Chunting Zhou, Lili Yu, Arun Babu, Kushal Tirumala, Michihiro Yasunaga, Leonid Shamis, Jacob Kahn, Xuezhe Ma, Luke Zettlemoyer, and Omer Levy. Transfusion: Predict the next token and diffuse images with one multi-modal model.arXiv preprint arXiv:2408.11039, 2024. 12
work page internal anchor Pith review Pith/arXiv arXiv 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.