pith. machine review for the scientific record. sign in

arxiv: 2604.24625 · v1 · submitted 2026-04-27 · 💻 cs.CV · cs.AI· cs.LG· cs.MM

Recognition: unknown

Meta-CoT: Enhancing Granularity and Generalization in Image Editing

Authors on Pith no claims yet

Pith reviewed 2026-05-08 04:26 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.LGcs.MM
keywords image editingchain of thoughtmeta-tasksdecompositiongeneralizationmulti-modal modelsconsistency reward
0
0 comments X

The pith

Meta-CoT decomposes any image edit into a task-target-understanding triplet plus five meta-tasks to raise both granularity and generalization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that unified multi-modal models improve image editing when their chain-of-thought process incorporates finer-grained understanding. It does so by introducing a two-level decomposition: first any editing intention is cast as the triplet of task, target, and required understanding ability, which produces task-specific reasoning and forces the model to learn each element separately; second, all tasks are reduced to five fundamental meta-tasks whose combination is claimed to be sufficient for generalization. A consistency reward is added to keep the actual edit aligned with the generated reasoning steps. The resulting method reports a 15.8 percent gain across 21 tasks and effective transfer to unseen edits after training on only the small meta-task set.

Core claim

Meta-CoT performs a two-level decomposition of single-image editing. In the first level any editing intention is represented as the triplet (task, target, required understanding ability); task-specific CoT is generated and editing operations are traversed across all targets. In the second level editing tasks are further broken into five fundamental meta-tasks. Training on these five meta-tasks together with the other two triplet elements is stated to be sufficient for strong generalization to diverse unseen tasks. The CoT-Editing Consistency Reward is introduced to align editing behavior with the reasoning trace, producing an overall 15.8 percent improvement across 21 editing tasks.

What carries the argument

The Meta-CoT two-level decomposition that casts editing as the triplet (task, target, required understanding ability) and reduces tasks to five meta-tasks, together with the CoT-Editing Consistency Reward that enforces alignment between reasoning and output edits.

If this is right

  • Training on the five meta-tasks together with the triplet elements is sufficient to generalize to diverse unseen editing tasks.
  • The decomposition guides the model to learn each element of the triplet separately, raising understanding granularity.
  • The consistency reward produces more accurate use of CoT information while performing the edit.
  • An overall 15.8 percent improvement is obtained across 21 editing tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same decomposition pattern could be tested on other multi-modal generation tasks to reduce reliance on large task-specific datasets.
  • Identifying a small universal set of meta-operations may prove useful for efficient fine-tuning across additional generative domains.
  • The explicit triplet representation and reward may make model behavior more interpretable and easier to debug than standard CoT approaches.

Load-bearing premise

That every possible editing intention can be expressed as the specific triplet of task, target, and required understanding ability and that five meta-tasks are enough to generalize to all other edits.

What would settle it

A model trained only on the five meta-tasks performs no better than a baseline on a broad collection of complex unseen editing tasks that require different understanding abilities.

Figures

Figures reproduced from arXiv: 2604.24625 by Chunyu Wang, Qinglin Lu, Runze He, Shiyi Zhang, Tiankai Hang, Wenxun Dai, Yansong Tang, Yiji Cheng, Yunlong Lin, Yu Xu, Zijin Yin.

Figure 1
Figure 1. Figure 1: (a) We propose Meta-CoT with a hierarchical decomposition design and the CoT–Editing Consistency Reward. On (b) 21-task view at source ↗
Figure 2
Figure 2. Figure 2: Overview of Triplet Decomposition. Triplet Decomposition enables fine-grained reasoning over both the task and the target through three steps: (1) Task Summary, (2) Task Thinking, and (3) Target-wise Editing Traversal. This decomposition allows the model to optimize reasoning and editing along two elements: task and target. During training, we jointly incorporate diverse visual understanding tasks to captu… view at source ↗
Figure 3
Figure 3. Figure 3: Examples of editing intent decomposed into meta-tasks. view at source ↗
Figure 5
Figure 5. Figure 5: Meta-CoT Data Construction. This pipeline processes the source image, target image, and instruction to the Meta-CoT. denoising timesteps, where semantic fidelity is most criti￾cal, and omit updates on later timesteps. Empirically, we find that reducing optimization on later timesteps also alle￾viates potential noise artifacts introduced by Flow-GRPO. 3.5. Training Pipeline As shown in view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative results across diverse editing tasks, including conventional editing, reasoning-based editing, and multi-instruction editing (Zoom in to view). We present a partial visualization of the Meta-CoT reasoning process. Meta-CoT can decompose instructions, categorize them into specific tasks, generate reasoning based on the task characteristics, and accurately determine whether each target should be … view at source ↗
read the original abstract

Unified multi-modal understanding/generative models have shown improved image editing performance by incorporating fine-grained understanding into their Chain-of-Thought (CoT) process. However, a critical question remains underexplored: what forms of CoT and training strategy can jointly enhance both the understanding granularity and generalization? To address this, we propose Meta-CoT, a paradigm that performs a two-level decomposition of any single-image editing operation with two key properties: (1) Decomposability. We observe that any editing intention can be represented as a triplet - (task, target, required understanding ability). Inspired by this, Meta-CoT decomposes both the editing task and the target, generating task-specific CoT and traversing editing operations on all targets. This decomposition enhances the model's understanding granularity of editing operations and guides it to learn each element of the triplet during training, substantially improving the editing capability. (2) Generalizability. In the second decomposition level, we further break down editing tasks into five fundamental meta-tasks. We find that training on these five meta-tasks, together with the other two elements of the triplet, is sufficient to achieve strong generalization across diverse, unseen editing tasks. To further align the model's editing behavior with its CoT reasoning, we introduce the CoT-Editing Consistency Reward, which encourages more accurate and effective utilization of CoT information during editing. Experiments demonstrate that our method achieves an overall 15.8% improvement across 21 editing tasks, and generalizes effectively to unseen editing tasks when trained on only a small set of meta-tasks. Our code, benchmark, and model are released at https://shiyi-zh0408.github.io/projectpages/Meta-CoT/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes Meta-CoT, a paradigm for image editing in unified multi-modal models that performs a two-level decomposition of editing operations. Any single-image editing intention is represented as the triplet (task, target, required understanding ability); the first level decomposes the task and target to produce task-specific CoT while traversing operations on all targets, and the second level further reduces editing tasks to five fundamental meta-tasks. Training on these meta-tasks plus the triplet elements, combined with a new CoT-Editing Consistency Reward, is claimed to improve understanding granularity and yield strong generalization. Experiments report an overall 15.8% improvement across 21 editing tasks and effective generalization to unseen tasks when trained on only the small meta-task set.

Significance. If the empirical claims hold, the work would be significant for the field by offering a structured, meta-task-based training strategy that jointly targets finer-grained CoT reasoning and broader generalization in image editing, potentially reducing reliance on large task-specific datasets. The public release of code, benchmark, and model is a clear strength that supports reproducibility.

major comments (2)
  1. [Abstract] Abstract (Generalizability paragraph): the central claim that 'training on these five meta-tasks, together with the other two elements of the triplet, is sufficient to achieve strong generalization across diverse, unseen editing tasks' is load-bearing yet lacks either a coverage argument or explicit out-of-distribution tests. If the 21 evaluated tasks are largely recombinations of the five meta-tasks, the reported generalization may not extend to edits requiring novel understanding abilities.
  2. [Experiments] Experiments section: the 15.8% overall improvement is presented without reference to the precise baselines, evaluation metrics, statistical significance tests, or train/test splits used for the 21 tasks. This information is required to assess whether the gains are robust and attributable to the proposed decomposition and reward.
minor comments (2)
  1. [Introduction] The triplet definition and its mapping to concrete editing examples could be illustrated earlier (e.g., in the introduction or method overview) to improve readability for readers unfamiliar with the decomposition.
  2. [Figures] Figure captions describing the two-level decomposition process should explicitly label the five meta-tasks and the CoT-Editing Consistency Reward to make the pipeline self-contained.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below with clarifications and note the revisions we will make to strengthen the paper.

read point-by-point responses
  1. Referee: [Abstract] Abstract (Generalizability paragraph): the central claim that 'training on these five meta-tasks, together with the other two elements of the triplet, is sufficient to achieve strong generalization across diverse, unseen editing tasks' is load-bearing yet lacks either a coverage argument or explicit out-of-distribution tests. If the 21 evaluated tasks are largely recombinations of the five meta-tasks, the reported generalization may not extend to edits requiring novel understanding abilities.

    Authors: We acknowledge the referee's point that the generalizability claim in the abstract would be strengthened by additional supporting arguments. The manuscript provides empirical support via experiments showing effective generalization to unseen tasks when trained only on the meta-task set. To directly address the concern, we will revise the abstract and the Generalizability section to include a concise coverage argument detailing how the five meta-tasks span the core understanding abilities, and we will clarify the out-of-distribution aspects of the 21 tasks (which include novel combinations beyond simple recombinations of meta-tasks). revision: partial

  2. Referee: [Experiments] Experiments section: the 15.8% overall improvement is presented without reference to the precise baselines, evaluation metrics, statistical significance tests, or train/test splits used for the 21 tasks. This information is required to assess whether the gains are robust and attributable to the proposed decomposition and reward.

    Authors: We agree that explicit details on baselines, metrics, statistical tests, and splits are necessary for proper assessment of the results. We will revise the Experiments section to prominently reference these elements alongside the 15.8% figure, including the specific baselines (standard unified multi-modal models), evaluation metrics, any significance testing performed, and the train/test splits for the 21 tasks. This will make the attribution to the proposed decomposition and reward clearer. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical training paradigm with no derivation reducing to inputs

full rationale

The paper proposes Meta-CoT as an empirical method: it states an observation that editing intentions decompose into a (task, target, required understanding ability) triplet, breaks tasks into five meta-tasks, and reports that training on them plus the triplet elements yields generalization, validated by 15.8% gains on 21 tasks. No equations, fitted parameters called predictions, or first-principles derivations appear in the provided text. Claims rest on experimental results rather than any chain that reduces by construction to the inputs or self-citations. The approach is therefore self-contained as a training strategy.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No mathematical axioms, free parameters, or invented physical entities are present; the work rests on empirical observations about editing task structure and the sufficiency of five meta-tasks.

pith-pipeline@v0.9.0 · 5650 in / 1117 out tokens · 34424 ms · 2026-05-08T04:26:42.924134+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

87 extracted references · 54 canonical work pages · 23 internal anchors

  1. [1]

    Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

    Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A versatile vision-language model for un- derstanding, localization, text reading, and beyond.arXiv preprint arXiv:2308.12966, 2023. 4

  2. [2]

    Qwen2.5-VL Technical Report

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhao- hai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Jun- yang Lin. Qwen2.5-vl technical repor...

  3. [3]

    In- structpix2pix: Learning to follow image editing instructions

    Tim Brooks, Aleksander Holynski, and Alexei A Efros. In- structpix2pix: Learning to follow image editing instructions. InCVPR, pages 18392–18402, 2023. 6

  4. [4]

    BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset

    Jiuhai Chen, Zhiyang Xu, Xichen Pan, Yushi Hu, Can Qin, Tom Goldstein, Lifu Huang, Tianyi Zhou, Saining Xie, Sil- vio Savarese, et al. Blip3-o: A family of fully open unified multimodal models-architecture, training and dataset.arXiv preprint arXiv:2505.09568, 2025. 4

  5. [5]

    Blip3o-next: Next frontier of native image generation.arXiv preprint arXiv:2510.15857, 2025

    Jiuhai Chen, Le Xue, Zhiyang Xu, Xichen Pan, Shusheng Yang, Can Qin, An Yan, Honglu Zhou, Zeyuan Chen, Lifu Huang, et al. Blip3o-next: Next frontier of native image generation.arXiv preprint arXiv:2510.15857, 2025. 6

  6. [6]

    Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling

    Xiaokang Chen, Chengyue Wu, Zhiyu Wu, Yiyang Ma, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, Chong Ruan, and Ping Luo. Janus-pro: Unified mul- timodal understanding and generation with data and model scaling.arXiv preprint arXiv:2501.17811, 2025. 4

  7. [7]

    Chatumm: Robust context track- ing for conversational interleaved generation.arXiv preprint arXiv:2602.06442, 2026

    Wenxun Dai, Zhiyuan Zhao, Yule Zhong, Yiji Cheng, Jian- wei Zhang, Linqing Wang, Shiyi Zhang, Yunlong Lin, Runze He, Fellix Song, et al. Chatumm: Robust context track- ing for conversational interleaved generation.arXiv preprint arXiv:2602.06442, 2026. 4

  8. [8]

    Emerging Properties in Unified Multimodal Pretraining

    Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, et al. Emerging properties in unified multimodal pretraining.arXiv preprint arXiv:2505.14683, 2025. 2, 3, 4, 6

  9. [9]

    Dreamllm: Synergistic multimodal com- prehension and creation

    Runpei Dong, Chunrui Han, Yuang Peng, Zekun Qi, Zheng Ge, Jinrong Yang, Liang Zhao, Jianjian Sun, Hongyu Zhou, Haoran Wei, et al. Dreamllm: Synergistic multimodal com- prehension and creation. InICLR, 2024. 4

  10. [10]

    GoT-R1: Unleashing Reasoning Capability of MLLM for Visual Generation with Reinforcement Learning

    Chengqi Duan, Rongyao Fang, Yuqing Wang, Kun Wang, Linjiang Huang, Xingyu Zeng, Hongsheng Li, and Xihui Liu. Got-r1: Unleashing reasoning capability of mllm for vi- sual generation with reinforcement learning.arXiv preprint arXiv:2505.17022, 2025. 4

  11. [11]

    Got: Unleashing reason- ing capability of multimodal large language model for visual generation and editing, 2025

    Rongyao Fang, Chengqi Duan, Kun Wang, Linjiang Huang, Hao Li, Shilin Yan, Hao Tian, Xingyu Zeng, Rui Zhao, Jifeng Dai, et al. Got: Unleashing reasoning capability of multimodal large language model for visual generation and editing.arXiv preprint arXiv:2503.10639, 2025. 2, 4, 6

  12. [12]

    Refocus: Visual editing as a chain of thought for structured image understanding

    Xingyu Fu, Minqian Liu, Zhengyuan Yang, John Cor- ring, Yijuan Lu, Jianwei Yang, Dan Roth, Dinei Floren- cio, and Cha Zhang. Refocus: Visual editing as a chain of thought for structured image understanding.arXiv preprint arXiv:2501.05452, 2025. 4

  13. [13]

    Mammoth-vl: Eliciting multimodal reasoning with instruction tuning at scale

    Jarvis Guo, Tuney Zheng, Yuelin Bai, Bo Li, Yubo Wang, King Zhu, Yizhi Li, Graham Neubig, Wenhu Chen, and Xi- ang Yue. Mammoth-vl: Eliciting multimodal reasoning with instruction tuning at scale.arXiv preprint arXiv:2412.05237,

  14. [14]

    Visual program- ming: Compositional visual reasoning without training

    Tanmay Gupta and Aniruddha Kembhavi. Visual program- ming: Compositional visual reasoning without training. In CVPR, pages 14953–14962, 2023. 4

  15. [15]

    Controlthinker: Unveiling latent semantics for controllable image generation through visual reasoning.arXiv preprint arXiv:2506.03596, 2025

    Feng Han, Yang Jiao, Shaoxiang Chen, Junhao Xu, Jingjing Chen, and Yu-Gang Jiang. Controlthinker: Unveiling latent semantics for controllable image generation through visual reasoning.arXiv preprint arXiv:2506.03596, 2025. 4

  16. [16]

    Freeedit: Mask-free reference-based image editing with multi-modal instruction.IEEE Transactions on Pattern Anal- ysis and Machine Intelligence, 2025

    Runze He, Kai Ma, Linjiang Huang, Shaofei Huang, Jialin Gao, Xiaoming Wei, Jiao Dai, Jizhong Han, and Si Liu. Freeedit: Mask-free reference-based image editing with multi-modal instruction.IEEE Transactions on Pattern Anal- ysis and Machine Intelligence, 2025. 4

  17. [17]

    Re-align: Structured reasoning-guided alignment for in-context image generation and editing.arXiv preprint arXiv:2601.05124, 2026

    Runze He, Yiji Cheng, Tiankai Hang, Zhimin Li, Yu Xu, Zijin Yin, Shiyi Zhang, Wenxun Dai, Penghui Du, Ao Ma, et al. Re-align: Structured reasoning-guided alignment for in-context image generation and editing.arXiv preprint arXiv:2601.05124, 2026. 4

  18. [18]

    Visual sketchpad: Sketching as a visual chain of thought for multimodal language models.NeurIPS, 37:139348–139379,

    Yushi Hu, Weijia Shi, Xingyu Fu, Dan Roth, Mari Osten- dorf, Luke Zettlemoyer, Noah A Smith, and Ranjay Krishna. Visual sketchpad: Sketching as a visual chain of thought for multimodal language models.NeurIPS, 37:139348–139379,

  19. [19]

    Image editing as programs with diffusion models.arXiv preprint arXiv:2506.04158, 2025

    Yujia Hu, Songhua Liu, Zhenxiong Tan, Xingyi Yang, and Xinchao Wang. Image editing as programs with diffusion models.arXiv preprint arXiv:2506.04158, 2025. 6

  20. [20]

    Huang, S

    Jiaxin Huang, Shixiang Shane Gu, Le Hou, Yuexin Wu, Xuezhi Wang, Hongkun Yu, and Jiawei Han. Large language models can self-improve.arXiv preprint arXiv:2210.11610,

  21. [21]

    Interleaving reasoning for better text-to-image generation.arXiv preprint arXiv:2509.06945, 2025

    Wenxuan Huang, Shuang Chen, Zheyong Xie, Shaosheng Cao, Shixiang Tang, Yufan Shen, Qingyu Yin, Wenbo Hu, Xiaoman Wang, Yuntian Tang, et al. Interleaving rea- soning for better text-to-image generation.arXiv preprint arXiv:2509.06945, 2025. 4

  22. [22]

    Ming-univision: Joint image under- standing and generation with a unified continuous tokenizer

    Ziyuan Huang, DanDan Zheng, Cheng Zou, Rui Liu, Xiao- long Wang, Kaixiang Ji, Weilong Chai, Jianxin Sun, Libin Wang, Yongjie Lv, et al. Ming-univision: Joint image under- standing and generation with a unified continuous tokenizer. arXiv preprint arXiv:2510.06590, 2025. 4, 6

  23. [23]

    T2i-r1: Reinforcing image generation with collaborative semantic-level and token-level cot

    Dongzhi Jiang, Ziyu Guo, Renrui Zhang, Zhuofan Zong, Hao Li, Le Zhuo, Shilin Yan, Pheng-Ann Heng, and Hong- sheng Li. T2i-r1: Reinforcing image generation with col- laborative semantic-level and token-level cot.arXiv preprint arXiv:2505.00703, 2025. 4

  24. [24]

    Ivan Krasin, Tom Duerig, Neil Alldrin, Andreas Veit, Sami Abu-El-Haija, Serge Belongie, David Cai, Zheyun Feng, Vittorio Ferrari, Victor Gomes, Abhinav Gupta, Chen Sun, Gal Chechik, Kevin Murphy, Dhyanesh Narayanan, Saurabh Shetty, Yang Song, Joseph Tighe, Andrea Vedaldi, Sudheendra Vijayanarasimhan, and Oriol Vinyals. Open- images: A public dataset for l...

  25. [25]

    Viescore: Towards explainable metrics for conditional image synthesis evaluation

    Max Ku, Dongfu Jiang, Cong Wei, Xiang Yue, and Wenhu Chen. Viescore: Towards explainable metrics for conditional image synthesis evaluation.arXiv preprint arXiv:2312.14867, 2023. 5, 6

  26. [26]

    Black Forest Labs, Stephen Batifol, Andreas Blattmann, Frederic Boesel, Saksham Consul, Cyril Diagne, Tim Dock- horn, Jack English, Zion English, Patrick Esser, et al. Flux. 1 kontext: Flow matching for in-context image generation and editing in latent space.arXiv preprint arXiv:2506.15742,

  27. [27]

    Prometheus-vision: Vision-language model as a judge for fine-grained evaluation

    Seongyun Lee, Seungone Kim, Sue Park, Geewook Kim, and Minjoon Seo. Prometheus-vision: Vision-language model as a judge for fine-grained evaluation. InFindings of the association for computational linguistics ACL 2024, pages 11286–11315, 2024. 5

  28. [28]

    LLaVA-OneVision: Easy Visual Task Transfer

    Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Zi- wei Liu, et al. Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326, 2024. 6

  29. [29]

    Imagine while reasoning in space: Multimodal visualization-of-thought.arXiv preprint arXiv:2501.07542, 2025

    Chengzu Li, Wenshan Wu, Huanyu Zhang, Yan Xia, Shaoguang Mao, Li Dong, Ivan Vuli´c, and Furu Wei. Imag- ine while reasoning in space: Multimodal visualization-of- thought.arXiv preprint arXiv:2501.07542, 2025. 4

  30. [30]

    Tooncomposer: Streamlining cartoon production with gen- erative post-keyframing.arXiv preprint arXiv:2508.10881,

    Lingen Li, Guangzhi Wang, Zhaoyang Zhang, Yaowei Li, Xiaoyu Li, Qi Dou, Jinwei Gu, Tianfan Xue, and Ying Shan. Tooncomposer: Streamlining cartoon production with gen- erative post-keyframing.arXiv preprint arXiv:2508.10881,

  31. [31]

    Nvcomposer: Boosting generative novel view synthesis with multiple sparse and unposed images

    Lingen Li, Zhaoyang Zhang, Yaowei Li, Jiale Xu, Wenbo Hu, Xiaoyu Li, Weihao Cheng, Jinwei Gu, Tianfan Xue, and Ying Shan. Nvcomposer: Boosting generative novel view synthesis with multiple sparse and unposed images. InPro- ceedings of the Computer Vision and Pattern Recognition Conference, pages 777–787, 2025. 4

  32. [32]

    arXiv preprint arXiv:2411.04996 , year =

    Weixin Liang, Lili Yu, Liang Luo, Srinivasan Iyer, Ning Dong, Chunting Zhou, Gargi Ghosh, Mike Lewis, Wen-tau Yih, Luke Zettlemoyer, et al. Mixture-of-transformers: A sparse and scalable architecture for multi-modal foundation models.arXiv preprint arXiv:2411.04996, 2024. 4

  33. [33]

    Mogao: An omni foundation model for interleaved multi-modal generation.arXiv preprint arXiv:2505.05472, 2025

    Chao Liao, Liyang Liu, Xun Wang, Zhengxiong Luo, Xinyu Zhang, Wenliang Zhao, Jie Wu, Liang Li, Zhi Tian, and Weilin Huang. Mogao: An omni foundation model for interleaved multi-modal generation.arXiv preprint arXiv:2505.05472, 2025. 4

  34. [34]

    UniWorld-V1: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation

    Bin Lin, Zongjian Li, Xinhua Cheng, Yuwei Niu, Yang Ye, Xianyi He, Shenghai Yuan, Wangbo Yu, Shaodong Wang, Yunyang Ge, et al. Uniworld-v1: High-resolution seman- tic encoders for unified visual understanding and generation. arXiv preprint arXiv:2506.03147, 2025. 6

  35. [35]

    Jarvisart: Liberating human artistic creativity via an intelligent photo retouching agent

    Yunlong Lin, Zixu Lin, Kunjie Lin, Jinbin Bai, Panwang Pan, Chenxin Li, Haoyu Chen, Zhongdao Wang, Xinghao Ding, Wenbo Li, et al. Jarvisart: Liberating human artistic creativ- ity via an intelligent photo retouching agent.arXiv preprint arXiv:2506.17612, 2025. 4

  36. [36]

    Jarvisevo: Towards a self-evolving photo editing agent with synergistic editor-evaluator optimization.arXiv preprint arXiv:2511.23002, 2025

    Yunlong Lin, Linqing Wang, Kunjie Lin, Zixu Lin, Kaixiong Gong, Wenbo Li, Bin Lin, Zhenxi Li, Shiyi Zhang, Yuyang Peng, et al. Jarvisevo: Towards a self-evolving photo editing agent with synergistic editor-evaluator optimization.arXiv preprint arXiv:2511.23002, 2025. 4

  37. [37]

    Flow-GRPO: Training Flow Matching Models via Online RL

    Jie Liu, Gongye Liu, Jiajun Liang, Yangguang Li, Jiaheng Liu, Xintao Wang, Pengfei Wan, Di Zhang, and Wanli Ouyang. Flow-grpo: Training flow matching models via on- line rl.arXiv preprint arXiv:2505.05470, 2025. 3, 5

  38. [38]

    Step1X-Edit: A Practical Framework for General Image Editing

    Shiyu Liu, Yucheng Han, Peng Xing, Fukun Yin, Rui Wang, Wei Cheng, Jiaqi Liao, Yingming Wang, Honghao Fu, Chun- rui Han, et al. Step1x-edit: A practical framework for general image editing.arXiv preprint arXiv:2504.17761, 2025. 3, 5, 6

  39. [39]

    Unified-io 2: Scaling autoregressive multimodal models with vision language audio and action

    Jiasen Lu, Christopher Clark, Sangho Lee, Zichen Zhang, Savya Khosla, Ryan Marten, Derek Hoiem, and Aniruddha Kembhavi. Unified-io 2: Scaling autoregressive multimodal models with vision language audio and action. InCVPR, pages 26439–26455, 2024. 4

  40. [40]

    Learn to explain: Multimodal reasoning via thought chains for science question answering.NeurIPS, 35: 2507–2521, 2022

    Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering.NeurIPS, 35: 2507–2521, 2022. 4

  41. [41]

    Follow your pose: Pose- guided text-to-video generation using pose-free videos

    Yue Ma, Yingqing He, Xiaodong Cun, Xintao Wang, Siran Chen, Xiu Li, and Qifeng Chen. Follow your pose: Pose- guided text-to-video generation using pose-free videos. In Proceedings of the AAAI Conference on Artificial Intelli- gence, pages 4117–4125, 2024. 4

  42. [42]

    Janusflow: Harmonizing autoregres- sion and rectified flow for unified multimodal understanding and generation.arXiv preprint arXiv:2411.07975, 2024

    Yiyang Ma, Xingchao Liu, Xiaokang Chen, Wen Liu, Chengyue Wu, Zhiyu Wu, Zizheng Pan, Zhenda Xie, Haowei Zhang, Xingkai Yu, Liang Zhao, Yisong Wang, Jiaying Liu, and Chong Ruan. Janusflow: Harmonizing autoregression and rectified flow for unified multimodal understanding and generation.arXiv preprint arXiv:2411.07975, 2024. 4

  43. [43]

    Fastvmt: Eliminat- ing redundancy in video motion transfer.arXiv preprint arXiv:2602.05551, 2026

    Yue Ma, Zhikai Wang, Tianhao Ren, Mingzhe Zheng, Hongyu Liu, Jiayi Guo, Mark Fong, Yuxuan Xue, Zixi- ang Zhao, Konrad Schindler, et al. Fastvmt: Eliminat- ing redundancy in video motion transfer.arXiv preprint arXiv:2602.05551, 2026. 4

  44. [44]

    Gpt-image-1, 2025

    OpenAI. Gpt-image-1, 2025. 6

  45. [45]

    Introducing 4o image generation, 2025

    OpenAI. Introducing 4o image generation, 2025. 4

  46. [46]

    Transfer between Modalities with MetaQueries

    Xichen Pan, Satya Narayan Shukla, Aashu Singh, Zhuokai Zhao, Shlok Kumar Mishra, Jialiang Wang, Zhiyang Xu, Ji- uhai Chen, Kunpeng Li, Felix Juefei-Xu, Ji Hou, and Saining Xie. Transfer between modalities with metaqueries.arXiv preprint arXiv:2504.06256, 2025. 4

  47. [47]

    Scalable diffusion models with transformers

    William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF inter- national conference on computer vision, pages 4195–4205,

  48. [48]

    Du, Zehuan Yuan, and Xinglong Wu

    Liao Qu, Huichao Zhang, Yiheng Liu, Xu Wang, Yi Jiang, Yiming Gao, Hu Ye, Daniel K Du, Zehuan Yuan, and Xinglong Wu. Tokenflow: Unified image tokenizer for multimodal understanding and generation.arXiv preprint arXiv:2412.03069, 2024. 4

  49. [49]

    Laion-5b: An open large-scale dataset for train- ing next generation image-text models.NeurIPS, 35:25278– 25294, 2022

    Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Worts- man, et al. Laion-5b: An open large-scale dataset for train- ing next generation image-text models.NeurIPS, 35:25278– 25294, 2022. 6

  50. [50]

    Visual cot: Advancing multi-modal language models with a comprehen- sive dataset and benchmark for chain-of-thought reasoning

    Hao Shao, Shengju Qian, Han Xiao, Guanglu Song, Zhuofan Zong, Letian Wang, Yu Liu, and Hongsheng Li. Visual cot: Advancing multi-modal language models with a comprehen- sive dataset and benchmark for chain-of-thought reasoning. NeurIPS, 37:8612–8642, 2024. 4

  51. [51]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. Deepseekmath: Pushing the limits of mathe- matical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024. 5

  52. [52]

    Llamafu- sion: Adapting pretrained language models for multimodal generation.arXiv preprint arXiv:2412.15188, 2024

    Weijia Shi, Xiaochuang Han, Chunting Zhou, Weixin Liang, Xi Victoria Lin, Luke Zettlemoyer, and Lili Yu. Llamafu- sion: Adapting pretrained language models for multimodal generation.arXiv preprint arXiv:2412.15188, 2024. 4

  53. [53]

    Emu: Generative pretraining in multimodality

    Quan Sun, Qiying Yu, Yufeng Cui, Fan Zhang, Xiaosong Zhang, Yueze Wang, Hongcheng Gao, Jingjing Liu, Tiejun Huang, and Xinlong Wang. Emu: Generative pretraining in multimodality. InICLR, 2024. 4

  54. [54]

    Chameleon: Mixed-Modal Early-Fusion Foundation Models

    Chameleon Team. Chameleon: Mixed-modal early-fusion foundation models.arXiv preprint arXiv:2405.09818, 2024. 4

  55. [55]

    On the estimation of relationships involving qualitative variables.American Journal of Sociology, 76(1): 103–154, 1970

    Henri Theil. On the estimation of relationships involving qualitative variables.American Journal of Sociology, 76(1): 103–154, 1970. 4

  56. [56]

    Metamorph: Multimodal under- standing and generation via instruction tuning.arXiv preprint arXiv:2412.14164, 2024

    Shengbang Tong, David Fan, Jiachen Zhu, Yunyang Xiong, Xinlei Chen, Koustuv Sinha, Michael Rabbat, Yann LeCun, Saining Xie, and Zhuang Liu. Metamorph: Multimodal understanding and generation via instruction tuning.arXiv preprint arXiv:2412.14164, 2024. 4

  57. [57]

    Self-Consistency Improves Chain of Thought Reasoning in Language Models

    Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reason- ing in language models.arXiv preprint arXiv:2203.11171,

  58. [58]

    Emu3: Next-Token Prediction is All You Need

    Xinlong Wang, Xiaosong Zhang, Zhengxiong Luo, Quan Sun, Yufeng Cui, Jinsheng Wang, Fan Zhang, Yueze Wang, Zhen Li, Qiying Yu, et al. Emu3: Next-token prediction is all you need.arXiv preprint arxiv:2409.18869, 2024. 4

  59. [59]

    Chain-of-thought prompting elicits reasoning in large lan- guage models.NeurIPS, 35:24824–24837, 2022

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large lan- guage models.NeurIPS, 35:24824–24837, 2022. 4

  60. [60]

    Janus: Decoupling visual encod- ing for unified multimodal understanding and generation

    Chengyue Wu, Xiaokang Chen, Zhiyu Wu, Yiyang Ma, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, Chong Ruan, et al. Janus: Decoupling visual encod- ing for unified multimodal understanding and generation. In CVPR, pages 12966–12977, 2025. 4

  61. [61]

    Qwen-Image Technical Report

    Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al. Qwen-image technical report.arXiv preprint arXiv:2508.02324, 2025. 6

  62. [62]

    OmniGen2: Towards Instruction-Aligned Multimodal Generation

    Chenyuan Wu, Pengfei Zheng, Ruiran Yan, Shitao Xiao, Xin Luo, Yueze Wang, Wanli Li, Xiyan Jiang, Yexin Liu, Junjie Zhou, et al. Omnigen2: Exploration to advanced multimodal generation.arXiv preprint arXiv:2506.18871, 2025. 4, 6

  63. [63]

    V?: Guided visual search as a core mechanism in multimodal llms

    Penghao Wu and Saining Xie. V?: Guided visual search as a core mechanism in multimodal llms. InCVPR, pages 13084–13094, 2024. 4

  64. [64]

    Next-gpt: Any-to-any multimodal llm

    Shengqiong Wu, Hao Fei, Leigang Qu, Wei Ji, and Tat-Seng Chua. Next-gpt: Any-to-any multimodal llm. InForty-first ICML, 2024. 4

  65. [65]

    Omnigen: Unified image genera- tion

    Shitao Xiao, Yueze Wang, Junjie Zhou, Huaying Yuan, Xin- grun Xing, Ruiran Yan, Chaofan Li, Shuting Wang, Tiejun Huang, and Zheng Liu. Omnigen: Unified image genera- tion. InCVPR, pages 13294–13304, 2025. 6

  66. [66]

    Show-o: One Single Transformer to Unify Multimodal Understanding and Generation

    Jinheng Xie, Weijia Mao, Zechen Bai, David Junhao Zhang, Weihao Wang, Kevin Qinghong Lin, Yuchao Gu, Zhijie Chen, Zhenheng Yang, and Mike Zheng Shou. Show-o: One single transformer to unify multimodal understanding and generation.arXiv preprint arXiv:2408.12528, 2024. 4

  67. [67]

    Show-o2: Improved Native Unified Multimodal Models

    Jinheng Xie, Zhenheng Yang, and Mike Zheng Shou. Show- o2: Improved native unified multimodal models.arXiv preprint arXiv:2506.15564, 2025. 4

  68. [68]

    Head- router: A training-free image editing framework for mm- dits by adaptively routing attention heads.arXiv preprint arXiv:2411.15034, 2024

    Yu Xu, Fan Tang, Juan Cao, Yuxin Zhang, Xiaoyu Kong, Jintao Li, Oliver Deussen, and Tong-Yee Lee. Head- router: A training-free image editing framework for mm- dits by adaptively routing attention heads.arXiv preprint arXiv:2411.15034, 2024. 4

  69. [69]

    Tag-moe: Task-aware gating for unified generative mixture-of-experts.arXiv preprint arXiv:2601.08881, 2026

    Yu Xu, Hongbin Yan, Juan Cao, Yiji Cheng, Tiankai Hang, Runze He, Zijin Yin, Shiyi Zhang, Yuxin Zhang, Jintao Li, et al. Tag-moe: Task-aware gating for unified generative mixture-of-experts.arXiv preprint arXiv:2601.08881, 2026. 4

  70. [70]

    Qwen2.5 Technical Report

    An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li, Tingyu X...

  71. [71]

    Uni-paint: A unified framework for multimodal image inpainting with pretrained diffusion model

    Shiyuan Yang, Xiaodong Chen, and Jing Liao. Uni-paint: A unified framework for multimodal image inpainting with pretrained diffusion model. InProceedings of the 31st ACM International Conference on Multimedia, pages 3190–3199,

  72. [72]

    Direct-a-video: Customized video generation with user- directed camera movement and object motion

    Shiyuan Yang, Liang Hou, Haibin Huang, Chongyang Ma, Pengfei Wan, Di Zhang, Xiaodong Chen, and Jing Liao. Direct-a-video: Customized video generation with user- directed camera movement and object motion. InACM SIG- GRAPH 2024 Conference Papers, pages 1–12, 2024. 4

  73. [73]

    arXiv preprint arXiv:2504.13143 (2025)

    Siwei Yang, Mude Hui, Bingchen Zhao, Yuyin Zhou, Nataniel Ruiz, and Cihang Xie. Complex-edit: Cot-like in- struction generation for complexity-controllable image edit- ing benchmark.arXiv preprint arXiv:2504.13143, 2025. 3, 6

  74. [74]

    Multimodal rewardbench: Holistic evalu- ation of reward models for vision language models.URL https://api

    Michihiro Yasunaga, Luke Zettlemoyer, and Marjan Ghazvininejad. Multimodal rewardbench: Holistic evalu- ation of reward models for vision language models.URL https://api. semanticscholar. org/CorpusID, 276482127,

  75. [75]

    ImgEdit: A Unified Image Editing Dataset and Benchmark

    Yang Ye, Xianyi He, Zongjian Li, Bin Lin, Shenghai Yuan, Zhiyuan Yan, Bohan Hou, and Li Yuan. Imgedit: A uni- fied image editing dataset and benchmark.arXiv preprint arXiv:2505.20275, 2025. 3, 6

  76. [76]

    Anyedit: Mastering unified high-quality image editing for any idea

    Qifan Yu, Wei Chow, Zhongqi Yue, Kaihang Pan, Yang Wu, Xiaoyang Wan, Juncheng Li, Siliang Tang, Hanwang Zhang, and Yueting Zhuang. Anyedit: Mastering unified high-quality image editing for any idea. InCVPR, pages 26125–26135, 2025. 6

  77. [77]

    Magicbrush: A manually annotated dataset for instruction- guided image editing.NeurIPS, 36:31428–31449, 2023

    Kai Zhang, Lingbo Mo, Wenhu Chen, Huan Sun, and Yu Su. Magicbrush: A manually annotated dataset for instruction- guided image editing.NeurIPS, 36:31428–31449, 2023. 6

  78. [78]

    Logo: A long-form video dataset for group action quality assessment

    Shiyi Zhang, Wenxun Dai, Sujia Wang, Xiangwei Shen, Ji- wen Lu, Jie Zhou, and Yansong Tang. Logo: A long-form video dataset for group action quality assessment. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2405–2414, 2023. 4

  79. [79]

    Narrative action evaluation with prompt-guided multimodal interaction

    Shiyi Zhang, Sule Bai, Guangyi Chen, Lei Chen, Jiwen Lu, Junle Wang, and Yansong Tang. Narrative action evaluation with prompt-guided multimodal interaction. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18430–18439, 2024. 4

  80. [80]

    Flexiact: Towards flexible action control in heterogeneous scenarios

    Shiyi Zhang, Junhao Zhuang, Zhaoyang Zhang, Ying Shan, and Yansong Tang. Flexiact: Towards flexible action control in heterogeneous scenarios. InProceedings of the Special Interest Group on Computer Graphics and Interactive Tech- niques Conference Conference Papers, pages 1–11, 2025. 4

Showing first 80 references.