pith. sign in

arxiv: 2512.07348 · v2 · submitted 2025-12-08 · 💻 cs.CV

MICo-150K: A Comprehensive Dataset Advancing Multi-Image Composition

Pith reviewed 2026-05-17 00:16 UTC · model grok-4.3

classification 💻 cs.CV
keywords multi-image compositionimage datasetcontrollable image generationAI image synthesisbenchmark evaluationmodel fine-tuningidentity consistencydata curation
0
0 comments X

The pith

A 150K dataset of multi-image compositions trains AI models to produce coherent outputs from arbitrary numbers of reference images.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper addresses the shortage of training data for multi-image composition by building MICo-150K, a collection of 150,000 high-quality composite images. The authors organize the task into seven representative categories, generate balanced examples through proprietary models, and apply human filtering and refinement to ensure identity consistency. They add a Decomposition-and-Recomposition subset drawn from 11,000 real-world images to mix synthetic and authentic cases. A dedicated benchmark and metric are introduced to measure progress. Fine-tuning experiments show that the data equips models lacking the skill and strengthens those that already possess some ability, with one baseline matching a limited competitor on three-image cases while handling any input count.

Core claim

The paper claims that MICo-150K, built by categorizing multi-image composition into seven tasks, synthesizing balanced composites with proprietary models, and refining them via human-in-the-loop processes, plus an 11K real-image De&Re subset, supplies the missing training resource. Models fine-tuned on it acquire or improve multi-image composition, as shown by Qwen-MICo matching Qwen-Image-2509 on three-image tasks while supporting arbitrary inputs.

What carries the argument

MICo-150K dataset with its seven-task synthesis pipeline, human refinement, and De&Re real-image subset, paired with MICo-Bench and the Weighted-Ref-VIEScore metric.

If this is right

  • Models without multi-image composition ability gain this capability after fine-tuning on MICo-150K.
  • Models that already possess some composition skill show measurable further gains.
  • The resulting baseline supports arbitrary numbers of input images rather than being restricted to a fixed count.
  • Standardized evaluation across seven tasks and 300 challenging real-derived cases becomes available through MICo-Bench.
  • The Weighted-Ref-VIEScore metric supplies a MICo-specific way to score identity preservation and overall quality.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The dataset may support downstream uses such as photo editing tools that combine user-provided references into new scenes.
  • Extending the decomposition-recomposition method to video frames could help maintain consistency across time.
  • Testing the trained models on entirely user-supplied real photos without further adaptation would reveal practical limits.
  • Larger-scale versions of the same curation process might narrow remaining gaps to human-level composition performance.

Load-bearing premise

Images synthesized by proprietary models and refined by humans supply high-quality, identity-consistent training examples that generalize to real-world multi-image composition.

What would settle it

Testing the fine-tuned models on a fresh set of human-composed real photographs outside the synthetic distribution and observing a large drop in identity consistency or coherence compared with MICo-Bench results would falsify generalization.

Figures

Figures reproduced from arXiv: 2512.07348 by Bairui Li, Hongyang Wei, Jinrui Zhang, Kai Cui, Kangrui Cen, Lei Zhang, Xinyu Wei, Zeqing Wang, Zhen Guo.

Figure 1
Figure 1. Figure 1: Previous MICo methods typically collect high-quality [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Construction pipeline of MICo-150K. (a) The data construction pipeline for the Human-Centric, Object-Centric, and HOI (Human–Object Interaction) tasks. (b) The pipeline for the De&Re (Decompose and Recompose) task [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Visualization examples from the MICo-150K dataset. [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: High-quality multi-image composition datasets that are [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Traditional VIEScore requires inputting all source and generated images into the evaluator, which often leads to degraded [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Qwen-Image-2509 is trained on a massive-scale dataset [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: The leftmost displays the source and reference images. The first row shows model outputs before fine-tuning, the second row [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Comparison of open-source models before and after MICo-150K training. Some source images were cropped or background [PITH_FULL_IMAGE:figures/full_fig_p011_8.png] view at source ↗
Figure 10
Figure 10. Figure 10: Human-face source images exhibit a Western-centric [PITH_FULL_IMAGE:figures/full_fig_p011_10.png] view at source ↗
Figure 9
Figure 9. Figure 9: Nano-Banana [18] produces more realistic images with stronger fidelity to the source inputs and a higher quality ceiling, but occasionally fails on certain cases. GPT-Image-1 [51] exhibits a more stylized, less photo-realistic look, yet remains highly stable and consistently yields semantically coherent results. source element is present in the target image. Objects, clothing items, and scenes. For non-hum… view at source ↗
Figure 11
Figure 11. Figure 11: We refer to the metric without the weighting factor [PITH_FULL_IMAGE:figures/full_fig_p013_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Similarity to the reference image does not influence how Weighted-Ref-VIEScore ranks model outputs, demonstrating that the [PITH_FULL_IMAGE:figures/full_fig_p014_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Although PQ and SC scores are computed using GPT-4o, we find no evidence of evaluator–generator coupling: images that are [PITH_FULL_IMAGE:figures/full_fig_p015_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Weighted-Ref-VIEScore effectively prevents copy–paste hacks. We segmented objects or persons from the source images and manually pasted them onto the scene image to form a na¨ıve, unharmonized composite. Although such copy–paste results achieve a perfect weight factor (since every source element appears in the output), their PQ and SC scores remain very low [PITH_FULL_IMAGE:figures/full_fig_p016_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Qwen-MICo consistently outperforms Qwen-Image-2509 across nearly all evaluation dimensions on the MICo-Bench three [PITH_FULL_IMAGE:figures/full_fig_p016_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Qwen-MICo exhibits strong emergent abilities in recognizing and composing complex [PITH_FULL_IMAGE:figures/full_fig_p017_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Qwen-MICo performs well on virtual makeup try-on (transferring the makeup in Image 2 onto the girl in Image 1). 18 [PITH_FULL_IMAGE:figures/full_fig_p018_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Qwen-MICo shows excellent performance on visually complex tasks that demand will introduce a layer of transparent distortion and refracted light, giving the origi [PITH_FULL_IMAGE:figures/full_fig_p019_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Qwen-MICo preserves the subject’s identity while accurately modeling [PITH_FULL_IMAGE:figures/full_fig_p019_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: Qwen-MICo preserves the entire appearance of Input 2 while correctly interpreting the prompt phrase [PITH_FULL_IMAGE:figures/full_fig_p020_20.png] view at source ↗
read the original abstract

In controllable image generation, synthesizing coherent and consistent images from multiple reference inputs, i.e., Multi-Image Composition (MICo), remains a challenging problem, partly hindered by the lack of high-quality training data. To bridge this gap, we conduct a systematic study of MICo, categorizing it into 7 representative tasks and curate a large-scale collection of high-quality source images and construct diverse MICo prompts. Leveraging powerful proprietary models, we synthesize a rich amount of balanced composite images, followed by human-in-the-loop filtering and refinement, resulting in MICo-150K, a comprehensive dataset for MICo with identity consistency. We further build a Decomposition-and-Recomposition (De&Re) subset, where 11K real-world complex images are decomposed into components and recomposed, enabling both real and synthetic compositions. To enable comprehensive evaluation, we construct MICo-Bench with 100 cases per task and 300 challenging De&Re cases, and further introduce a new metric, Weighted-Ref-VIEScore, specifically tailored for MICo evaluation. Finally, we fine-tune multiple models on MICo-150K and evaluate them on MICo-Bench. The results show that MICo-150K effectively equips models without MICo capability and further enhances those with existing skills. Notably, our baseline model, Qwen-MICo, fine-tuned from Qwen-Image-Edit, matches Qwen-Image-2509 in 3-image composition while supporting arbitrary multi-image inputs beyond the latter's limitation. Our dataset, benchmark, and baseline collectively offer valuable resources for further research on Multi-Image Composition.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces MICo-150K, a 150K-example dataset for multi-image composition (MICo) in controllable image generation. It categorizes MICo into 7 tasks, synthesizes balanced composites via proprietary models with human-in-the-loop filtering and refinement, and includes an 11K-example Decomposition-and-Recomposition (De&Re) subset derived from real-world images. The work also presents MICo-Bench (100 cases per task plus 300 De&Re cases) and the Weighted-Ref-VIEScore metric, then demonstrates via fine-tuning that models such as Qwen-MICo (from Qwen-Image-Edit) acquire or improve MICo skills, matching Qwen-Image-2509 on 3-image composition while supporting arbitrary numbers of inputs.

Significance. If the reported gains hold under rigorous scrutiny, the dataset and benchmark would address a clear data scarcity issue in identity-consistent multi-image synthesis, providing community resources that mix synthetic scale with real-image grounding via De&Re. The explicit support for arbitrary input counts beyond fixed baselines is a practical advance for controllable generation pipelines.

major comments (2)
  1. [§5 (Experiments)] §5 (Experiments): The headline claim that Qwen-MICo matches Qwen-Image-2509 on 3-image composition while handling arbitrary inputs is load-bearing for the assertion that MICo-150K 'effectively equips models.' No numerical scores, standard deviations, or full baseline tables are referenced in the evaluation summary, preventing assessment of effect size or statistical reliability.
  2. [§3 (Dataset construction)] §3 (Dataset construction) and §5 (Evaluation): Only 11K of the 150K examples come from real De&Re images; the remainder are proprietary-model synthetics after human filtering. No quantitative identity-consistency metrics (e.g., face-similarity scores or CLIP-based consistency) are reported on held-out real photographs, leaving the transfer assumption from synthetic to real multi-image scenarios untested and central to the generalization claim.
minor comments (2)
  1. [§4 (Benchmark and Metric)] The formal definition and weighting scheme of the new Weighted-Ref-VIEScore metric should be stated explicitly with an equation or pseudocode to allow independent reproduction.
  2. [§3 (Dataset construction)] Filtering criteria and prompt templates used during human-in-the-loop refinement are described at high level; a supplementary table listing exact rejection rates or inter-annotator agreement would improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below, indicating where revisions have been made to strengthen the manuscript.

read point-by-point responses
  1. Referee: [§5 (Experiments)] §5 (Experiments): The headline claim that Qwen-MICo matches Qwen-Image-2509 on 3-image composition while handling arbitrary inputs is load-bearing for the assertion that MICo-150K 'effectively equips models.' No numerical scores, standard deviations, or full baseline tables are referenced in the evaluation summary, preventing assessment of effect size or statistical reliability.

    Authors: We agree that the evaluation summary in §5 should explicitly reference quantitative results to support the headline claim. The detailed comparisons, including scores on 3-image composition and arbitrary input counts, along with standard deviations, appear in the full experimental tables. We have revised §5 to include key numerical values from those tables and to direct readers to the complete baseline results for proper assessment of effect sizes. revision: yes

  2. Referee: [§3 (Dataset construction)] §3 (Dataset construction) and §5 (Evaluation): Only 11K of the 150K examples come from real De&Re images; the remainder are proprietary-model synthetics after human filtering. No quantitative identity-consistency metrics (e.g., face-similarity scores or CLIP-based consistency) are reported on held-out real photographs, leaving the transfer assumption from synthetic to real multi-image scenarios untested and central to the generalization claim.

    Authors: The referee is correct that the majority of examples are synthetic and that explicit quantitative identity-consistency metrics on held-out real photographs were not reported. The De&Re subset provides real-image grounding, and human filtering was applied for quality control. To address the transfer assumption, we have added new quantitative evaluations (face-similarity and CLIP consistency) on held-out real images to the revised §5. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation or claims

full rationale

The paper's core contribution is a data-curation pipeline that synthesizes composites via proprietary models, applies human-in-the-loop filtering, and constructs a separate benchmark (MICo-Bench) with real De&Re cases. Reported results consist of empirical fine-tuning outcomes on this dataset evaluated against the benchmark and external models (Qwen-Image-Edit, Qwen-Image-2509). No equations, predictions, or central claims reduce by construction to fitted parameters from the same data, no self-citations bear load-bearing weight on uniqueness or ansatzes, and no renaming of known results occurs. The chain remains a standard empirical dataset-plus-benchmark workflow that is self-contained against external model comparisons.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the effectiveness of proprietary-model synthesis plus human filtering to produce usable training data. No free parameters are introduced; the only notable assumption is the domain-level claim that the resulting composites are high-quality and identity-consistent.

axioms (1)
  • domain assumption Proprietary models can generate high-quality, identity-consistent composite images that become suitable training data after human-in-the-loop filtering and refinement.
    Invoked throughout the dataset construction process described in the abstract to justify the utility of MICo-150K.

pith-pipeline@v0.9.0 · 5622 in / 1454 out tokens · 150376 ms · 2026-05-17T00:16:11.119252+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. UniCustom: Unified Visual Conditioning for Multi-Reference Image Generation

    cs.CV 2026-05 unverdicted novelty 7.0

    UniCustom fuses ViT and VAE features before VLM encoding and uses two-stage training plus slot-wise regularization to improve subject consistency in multi-reference diffusion-based image generation.

  2. Banana100: Breaking NR-IQA Metrics by 100 Iterative Image Replications with Nano Banana Pro

    cs.CV 2026-04 unverdicted novelty 7.0

    Banana100 dataset shows that none of 21 popular NR-IQA metrics consistently rate images degraded by 100 iterative edits lower than clean originals.

  3. UniCustom: Unified Visual Conditioning for Multi-Reference Image Generation

    cs.CV 2026-05 unverdicted novelty 6.0

    A unified visual conditioning approach fuses semantic and appearance features before VLM processing, with two-stage training and slot-wise regularization, to improve consistency in multi-reference image generation.

  4. LLaDA2.0-Uni: Unifying Multimodal Understanding and Generation with Diffusion Large Language Model

    cs.CV 2026-04 unverdicted novelty 6.0

    LLaDA2.0-Uni unifies multimodal understanding and generation inside one discrete diffusion large language model with a semantic tokenizer, MoE backbone, and diffusion decoder.

Reference graph

Works this paper leans on

99 extracted references · 99 canonical work pages · cited by 3 Pith papers · 2 internal anchors

  1. [1]

    Mc-llava: Multi-concept personalized vision-language model.arXiv preprint arXiv:2411.11706, 2024

    Ruichuan An, Sihan Yang, Ming Lu, Renrui Zhang, Kai Zeng, Yulin Luo, Jiajun Cao, Hao Liang, Ying Chen, Qi She, et al. Mc-llava: Multi-concept personalized vision-language model.arXiv preprint arXiv:2411.11706, 2024. 12

  2. [2]

    Unictokens: Boosting personalized understand- ing and generation via unified concept tokens.arXiv preprint arXiv:2505.14671, 2025

    Ruichuan An, Sihan Yang, Renrui Zhang, Zijun Shen, Ming Lu, Gaole Dai, Hao Liang, Ziyu Guo, Shilin Yan, Yulin Luo, et al. Unictokens: Boosting personalized understand- ing and generation via unified concept tokens.arXiv preprint arXiv:2505.14671, 2025. 12

  3. [3]

    Qwen2.5-vl technical report, 2025

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhao- hai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report,...

  4. [4]

    headshot istockphoto: Headshot image dataset from istockphoto.https : / / huggingface

    BKM1804. headshot istockphoto: Headshot image dataset from istockphoto.https : / / huggingface . co / datasets / BKM1804 / headshot _ istockphoto,

  5. [5]

    Accessed: 2025-11-05, 6 000 images. 3, 10

  6. [6]

    headshot pexels v1: High-resolution headshot dataset from pexels.https : / / huggingface

    BKM1804. headshot pexels v1: High-resolution headshot dataset from pexels.https : / / huggingface . co / datasets/BKM1804/headshot_pexels_v1, 2025. Accessed: 2025-11-05, includes 3 000 images. 3, 10

  7. [7]

    Tim Brooks, Aleksander Holynski, and Alexei A. Efros. In- structpix2pix: Learning to follow image editing instructions,

  8. [8]

    Hunyuanimage 3.0 technical report,

    Siyu Cao, Hangting Chen, Peng Chen, Yiji Cheng, Yu- tao Cui, Xinchi Deng, Ying Dong, Kipper Gong, Tianpeng Gu, Xiusen Gu, Tiankai Hang, Duojun Huang, Jie Jiang, Zhengkai Jiang, Weijie Kong, Changlin Li, Donghao Li, Jun- zhe Li, Xin Li, Yang Li, Zhenxi Li, Zhimin Li, Jiaxin Lin, Li- nus, Lucaz Liu, Shu Liu, Songtao Liu, Yu Liu, Yuhong Liu, Yanxin Long, Fan...

  9. [9]

    Conceptual 12m: Pushing web-scale image-text pre- training to recognize long-tail visual concepts, 2021

    Soravit Changpinyo, Piyush Sharma, Nan Ding, and Radu Soricut. Conceptual 12m: Pushing web-scale image-text pre- training to recognize long-tail visual concepts, 2021. 6, 10

  10. [10]

    Xverse: Consistent multi-subject control of identity and semantic attributes via dit modulation, 2025

    Bowen Chen, Mengyi Zhao, Haomiao Sun, Li Chen, Xu Wang, Kang Du, and Xinglong Wu. Xverse: Consistent multi-subject control of identity and semantic attributes via dit modulation, 2025. 1, 3

  11. [11]

    Disenbooth: Identity- preserving disentangled tuning for subject-driven text-to- image generation, 2024

    Hong Chen, Yipeng Zhang, Simin Wu, Xin Wang, Xuguang Duan, Yuwei Zhou, and Wenwu Zhu. Disenbooth: Identity- preserving disentangled tuning for subject-driven text-to- image generation, 2024. 1

  12. [12]

    Pixart-α: Fast training of dif- fusion transformer for photorealistic text-to-image synthesis,

    Junsong Chen, Jincheng Yu, Chongjian Ge, Lewei Yao, Enze Xie, Yue Wu, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, and Zhenguo Li. Pixart-α: Fast training of dif- fusion transformer for photorealistic text-to-image synthesis,

  13. [13]

    Blip3- o: A family of fully open unified multimodal models- architecture, training and dataset, 2025

    Jiuhai Chen, Zhiyang Xu, Xichen Pan, Yushi Hu, Can Qin, Tom Goldstein, Lifu Huang, Tianyi Zhou, Saining Xie, Sil- vio Savarese, Le Xue, Caiming Xiong, and Ran Xu. Blip3- o: A family of fully open unified multimodal models- architecture, training and dataset, 2025. 2, 8, 11, 13

  14. [14]

    Blip3o-next: Next frontier of native image generation, 2025

    Jiuhai Chen, Le Xue, Zhiyang Xu, Xichen Pan, Shusheng Yang, Can Qin, An Yan, Honglu Zhou, Zeyuan Chen, Lifu Huang, Tianyi Zhou, Junnan Li, Silvio Savarese, Caiming Xiong, and Ran Xu. Blip3o-next: Next frontier of native image generation, 2025. 8

  15. [15]

    Lamic: Layout-aware multi- image composition via scalability of multimodal diffusion transformer, 2025

    Yuzhuo Chen, Zehua Ma, Jianhua Wang, Kai Kang, Shunyu Yao, and Weiming Zhang. Lamic: Layout-aware multi- image composition via scalability of multimodal diffusion transformer, 2025. 1, 3

  16. [16]

    Viton-hd: High-resolution virtual try-on via misalignment-aware normalization, 2021

    Seunghwan Choi, Sunghyun Park, Minsoo Lee, and Jaegul Choo. Viton-hd: High-resolution virtual try-on via misalignment-aware normalization, 2021. 3, 10

  17. [17]

    Idadapter: Learn- ing mixed features for tuning-free personalization of text-to- image models

    Siying Cui, Jia Guo, Xiang An, Jiankang Deng, Yongle Zhao, Xinyu Wei, and Ziyong Feng. Idadapter: Learn- ing mixed features for tuning-free personalization of text-to- image models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Work- shops, pages 950–959, 2024. 1

  18. [18]

    Idadapter: Learn- ing mixed features for tuning-free personalization of text-to- image models, 2024

    Siying Cui, Jia Guo, Xiang An, Jiankang Deng, Yongle Zhao, Xinyu Wei, and Ziyong Feng. Idadapter: Learn- ing mixed features for tuning-free personalization of text-to- image models, 2024. 1

  19. [19]

    Introducing gemini 1.5 flash: Fast, efficient, and multimodal.https : / / developers

    Google DeepMind. Introducing gemini 1.5 flash: Fast, efficient, and multimodal.https : / / developers . googleblog.com/en/introducing- gemini- 2- 5-flash-image/, 2024. Accessed: 2025-10-18. 1, 2, 4, 5, 7, 10, 12, 13, 14

  20. [20]

    Emerging Properties in Unified Multimodal Pretraining

    Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, Guang Shi, and Haoqi Fan. Emerging properties in unified multimodal pretraining.arXiv preprint arXiv:2505.14683, 2025. 2, 8, 11, 13

  21. [21]

    Emerging properties in unified multimodal pretraining, 2025

    Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, Guang Shi, and Haoqi Fan. Emerging properties in unified multimodal pretraining, 2025. 1

  22. [22]

    Arcface: Additive angular mar- gin loss for deep face recognition.IEEE Transactions on Pat- tern Analysis and Machine Intelligence, 44(10):5962–5979,

    Jiankang Deng, Jia Guo, Jing Yang, Niannan Xue, Irene Kot- sia, and Stefanos Zafeiriou. Arcface: Additive angular mar- gin loss for deep face recognition.IEEE Transactions on Pat- tern Analysis and Machine Intelligence, 44(10):5962–5979,

  23. [23]

    Scaling rectified flow trans- formers for high-resolution image synthesis, 2024

    Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas M ¨uller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, Dustin Podell, Tim 21 Dockhorn, Zion English, Kyle Lacey, Alex Goodwin, Yan- nik Marek, and Robin Rombach. Scaling rectified flow trans- formers for high-resolution image synthesis, 2024. 1

  24. [24]

    A density-based algorithm for discovering clusters in large spatial databases with noise

    Martin Ester, Hans-Peter Kriegel, J ¨org Sander, and Xiaowei Xu. A density-based algorithm for discovering clusters in large spatial databases with noise. InKnowledge Discovery and Data Mining, 1996. 3

  25. [25]

    Seedream 3.0 technical report, 2025

    Yu Gao, Lixue Gong, Qiushan Guo, Xiaoxia Hou, Zhichao Lai, Fanshi Li, Liang Li, Xiaochen Lian, Chao Liao, Liyang Liu, Wei Liu, Yichun Shi, Shiqi Sun, Yu Tian, Zhi Tian, Peng Wang, Rui Wang, Xuanda Wang, Xun Wang, Ye Wang, Guofeng Wu, Jie Wu, Xin Xia, Xuefeng Xiao, Zhonghua Zhai, Xinyu Zhang, Qi Zhang, Yuwei Zhang, Shijia Zhao, Jianchao Yang, and Weilin Hu...

  26. [26]

    See- dream 2.0: A native chinese-english bilingual image genera- tion foundation model, 2025

    Lixue Gong, Xiaoxia Hou, Fanshi Li, Liang Li, Xiaochen Lian, Fei Liu, Liyang Liu, Wei Liu, Wei Lu, Yichun Shi, Shiqi Sun, Yu Tian, Zhi Tian, Peng Wang, Xun Wang, Ye Wang, Guofeng Wu, Jie Wu, Xin Xia, Xuefeng Xiao, Lin- jie Yang, Zhonghua Zhai, Xinyu Zhang, Qi Zhang, Yuwei Zhang, Shijia Zhao, Jianchao Yang, and Weilin Huang. See- dream 2.0: A native chines...

  27. [27]

    Pulid: Pure and lightning id customiza- tion via contrastive alignment, 2024

    Zinan Guo, Yanze Wu, Zhuowei Chen, Lang Chen, Peng Zhang, and Qian He. Pulid: Pure and lightning id customiza- tion via contrastive alignment, 2024. 1, 3

  28. [28]

    Are video models ready as zero-shot reasoners? an empirical study with the mme-cof benchmark.arXiv preprint arXiv:2510.26802, 2025

    Ziyu Guo, Xinyan Chen, Renrui Zhang, Ruichuan An, Yu Qi, Dongzhi Jiang, Xiangtai Li, Manyuan Zhang, Hongsheng Li, and Pheng-Ann Heng. Are video models ready as zero-shot reasoners? an empirical study with the mme-cof benchmark. arXiv preprint arXiv:2510.26802, 2025. 14

  29. [29]

    Thinking-while- generating: Interleaving textual reasoning throughout vi- sual generation.arXiv preprint arXiv:2511.16671, 2025a

    Ziyu Guo, Renrui Zhang, Hongyu Li, Manyuan Zhang, Xinyan Chen, Sifan Wang, Yan Feng, Peng Pei, and Pheng- Ann Heng. Thinking-while-generating: Interleaving tex- tual reasoning throughout visual generation.arXiv preprint arXiv:2511.16671, 2025. 1

  30. [30]

    Can we generate images with cot? let’s verify and reinforce image generation step by step,

    Ziyu Guo, Renrui Zhang, Chengzhuo Tong, Zhizheng Zhao, Rui Huang, Haoquan Zhang, Manyuan Zhang, Jiaming Liu, Shanghang Zhang, Peng Gao, et al. Can we generate images with cot? let’s verify and reinforce image generation step by step.arXiv preprint arXiv:2501.13926, 2025. 1

  31. [31]

    Ella: Equip diffusion models with llm for en- hanced semantic alignment, 2024

    Xiwei Hu, Rui Wang, Yixiao Fang, Bin Fu, Pei Cheng, and Gang Yu. Ella: Equip diffusion models with llm for en- hanced semantic alignment, 2024. 14

  32. [32]

    Resolving multi-condition confusion for finetuning-free personalized image generation, 2024

    Qihan Huang, Siming Fu, Jinlong Liu, Hao Jiang, Yipeng Yu, and Jie Song. Resolving multi-condition confusion for finetuning-free personalized image generation, 2024. 1, 3

  33. [33]

    T2i-r1: Reinforcing image generation with collaborative semantic-level and token-level cot.arXiv preprint arXiv:2505.00703, 2025

    Dongzhi Jiang, Ziyu Guo, Renrui Zhang, Zhuofan Zong, Hao Li, Le Zhuo, Shilin Yan, Pheng-Ann Heng, and Hong- sheng Li. T2i-r1: Reinforcing image generation with col- laborative semantic-level and token-level cot.arXiv preprint arXiv:2505.00703, 2025. 1

  34. [34]

    Berg, Wan-Yen Lo, Piotr Doll ´ar, and Ross Girshick

    Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer White- head, Alexander C. Berg, Wan-Yen Lo, Piotr Doll ´ar, and Ross Girshick. Segment anything, 2023. 1, 2, 3

  35. [35]

    Viescore: Towards explainable metrics for conditional image synthesis evaluation, 2024

    Max Ku, Dongfu Jiang, Cong Wei, Xiang Yue, and Wenhu Chen. Viescore: Towards explainable metrics for conditional image synthesis evaluation, 2024. 7

  36. [36]

    Harold W. Kuhn. The hungarian method for the assignment problem.Naval Research Logistics (NRL), 52, 1955. 6

  37. [37]

    Flux.1 – official inference repository for flux models.https : / / github

    Black Forest Labs. Flux.1 – official inference repository for flux models.https : / / github . com / black - forest-labs/flux, 2024. Accessed: 2025-10-18. 1

  38. [38]

    Flux.1 kontext: Flow matching for in-context image generation and editing in latent space,

    Black Forest Labs, Stephen Batifol, Andreas Blattmann, Frederic Boesel, Saksham Consul, Cyril Diagne, Tim Dock- horn, Jack English, Zion English, Patrick Esser, Sumith Ku- lal, Kyle Lacey, Yam Levi, Cheng Li, Dominik Lorenz, Jonas M¨uller, Dustin Podell, Robin Rombach, Harry Saini, Axel Sauer, and Luke Smith. Flux.1 kontext: Flow matching for in-context i...

  39. [39]

    Photomaker: Customizing re- alistic human photos via stacked id embedding, 2023

    Zhen Li, Mingdeng Cao, Xintao Wang, Zhongang Qi, Ming- Ming Cheng, and Ying Shan. Photomaker: Customizing re- alistic human photos via stacked id embedding, 2023. 1

  40. [40]

    Hunyuan-dit: A powerful multi-resolution diffusion trans- former with fine-grained chinese understanding, 2024

    Zhimin Li, Jianwei Zhang, Qin Lin, Jiangfeng Xiong, Yanxin Long, Xinchi Deng, Yingfang Zhang, Xingchao Liu, Minbin Huang, Zedong Xiao, Dayou Chen, Jiajun He, Jiahao Li, Wenyue Li, Chen Zhang, Rongwei Quan, Jianxiang Lu, Jiabin Huang, Xiaoyan Yuan, Xiaoxiao Zheng, Yixuan Li, Ji- hong Zhang, Chao Zhang, Meng Chen, Jie Liu, Zheng Fang, Weiyan Wang, Jinbao Xu...

  41. [41]

    Scale-aware modulation meet transformer

    Weifeng Lin, Ziheng Wu, Jiayu Chen, Jun Huang, and Lian- wen Jin. Scale-aware modulation meet transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 6015–6026, 2023. 12

  42. [42]

    Draw-and-understand: Leveraging visual prompts to enable mllms to comprehend what you want.arXiv preprint arXiv:2403.20271,

    Weifeng Lin, Xinyu Wei, Ruichuan An, Peng Gao, Bocheng Zou, Yulin Luo, Siyuan Huang, Shanghang Zhang, and Hongsheng Li. Draw-and-understand: Leveraging visual prompts to enable mllms to comprehend what you want. arXiv preprint arXiv:2403.20271, 2024. 12

  43. [43]

    Pixwizard: Versatile image-to-image visual assis- 15 tant with open-language instructions,

    Weifeng Lin, Xinyu Wei, Renrui Zhang, Le Zhuo, Shitian Zhao, Siyuan Huang, Huan Teng, Junlin Xie, Yu Qiao, Peng Gao, et al. Pixwizard: Versatile image-to-image visual assistant with open-language instructions.arXiv preprint arXiv:2409.15278, 2024. 1

  44. [44]

    Perceive anything: Recog- nize, explain, caption, and segment anything in images and videos.arXiv preprint arXiv:2506.05302, 2025

    Weifeng Lin, Xinyu Wei, Ruichuan An, Tianhe Ren, Tingwei Chen, Renrui Zhang, Ziyu Guo, Wentao Zhang, Lei Zhang, and Hongsheng Li. Perceive anything: Recognize, explain, caption, and segment anything in images and videos.arXiv preprint arXiv:2506.05302, 2025. 12

  45. [45]

    Grounding dino: Marry- ing dino with grounded pre-training for open-set object de- tection, 2024

    Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, and Lei Zhang. Grounding dino: Marry- ing dino with grounded pre-training for open-set object de- tection, 2024. 1, 2, 3

  46. [46]

    Step1x-edit: A practical framework for gen- eral image editing, 2025

    Shiyu Liu, Yucheng Han, Peng Xing, Fukun Yin, Rui Wang, Wei Cheng, Jiaqi Liao, Yingming Wang, Honghao Fu, Chun- rui Han, Guopeng Li, Yuang Peng, Quan Sun, Jingwei Wu, 22 Yan Cai, Zheng Ge, Ranchen Ming, Lei Xia, Xianfang Zeng, Yibo Zhu, Binxing Jiao, Xiangyu Zhang, Gang Yu, and Daxin Jiang. Step1x-edit: A practical framework for gen- eral image editing, 2025. 1

  47. [47]

    Llm as dataset ana- lyst: Subpopulation structure discovery with large language model

    Yulin Luo, Ruichuan An, Bocheng Zou, Yiming Tang, Ji- aming Liu, and Shanghang Zhang. Llm as dataset ana- lyst: Subpopulation structure discovery with large language model. InEuropean Conference on Computer Vision, pages 235–252. Springer, 2024. 12

  48. [48]

    Subject- diffusion:open domain personalized text-to-image genera- tion without test-time fine-tuning, 2024

    Jian Ma, Junhao Liang, Chen Chen, and Haonan Lu. Subject- diffusion:open domain personalized text-to-image genera- tion without test-time fine-tuning, 2024. 1, 3

  49. [49]

    T2i- adapter: Learning adapters to dig out more controllable abil- ity for text-to-image diffusion models, 2023

    Chong Mou, Xintao Wang, Liangbin Xie, Yanze Wu, Jian Zhang, Zhongang Qi, Ying Shan, and Xiaohu Qie. T2i- adapter: Learning adapters to dig out more controllable abil- ity for text-to-image diffusion models, 2023. 1

  50. [50]

    Dreamo: A unified framework for image cus- tomization, 2025

    Chong Mou, Yanze Wu, Wenxu Wu, Zinan Guo, Pengze Zhang, Yufeng Cheng, Yiming Luo, Fei Ding, Shiwen Zhang, Xinghui Li, Mengtian Li, Mingcong Liu, Yi Zhang, Shaojin Wu, Songtao Zhao, Jian Zhang, Qian He, and Xin- glong Wu. Dreamo: A unified framework for image cus- tomization, 2025. 1, 3

  51. [51]

    Dall·e 3.https://openai.com/index/ dall-e-3/, 2023

    OpenAI. Dall·e 3.https://openai.com/index/ dall-e-3/, 2023. Accessed: 2025-10-14. 1

  52. [52]

    Gpt-4o image generation.https : / / openai

    OpenAI. Gpt-4o image generation.https : / / openai . com / index / introducing - 4o - image - generation/, 2025. Accessed: 2025-10-18. 1, 2, 10, 12, 13, 14

  53. [53]

    Dinov2: Learning robust visual features with- out supervision, 2024

    Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mah- moud Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Herv ´e Je- gou, Julien Mairal, ...

  54. [54]

    Conceptbed: Evaluating concept learning abilities of text-to-image diffusion models, 2024

    Maitreya Patel, Tejas Gokhale, Chitta Baral, and Yezhou Yang. Conceptbed: Evaluating concept learning abilities of text-to-image diffusion models, 2024. 1

  55. [55]

    Maitreya Patel, Sangmin Jung, Chitta Baral, and Yezhou Yang.λ-eclipse: Multi-concept personalized text-to-image diffusion models by leveraging clip latent space, 2024. 1, 3

  56. [56]

    Sdxl: Improving latent diffusion models for high-resolution image synthesis, 2023

    Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas M ¨uller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis, 2023. 1

  57. [57]

    Unicontrol: A unified diffusion model for controllable visual generation in the wild, 2023

    Can Qin, Shu Zhang, Ning Yu, Yihao Feng, Xinyi Yang, Yingbo Zhou, Huan Wang, Juan Carlos Niebles, Caiming Xiong, Silvio Savarese, Stefano Ermon, Yun Fu, and Ran Xu. Unicontrol: A unified diffusion model for controllable visual generation in the wild, 2023. 1

  58. [58]

    Qwen-image-edit-2509.https : / / huggingface

    QwenLM. Qwen-image-edit-2509.https : / / huggingface . co / Qwen / Qwen - Image - Edit - 2509, 2025. Accessed: 2025-11-12. 3, 8, 15

  59. [59]

    Sam 2: Segment anything in images and videos,

    Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman R¨adle, Chloe Rolland, Laura Gustafson, Eric Mintun, Junt- ing Pan, Kalyan Vasudev Alwala, Nicolas Carion, Chao- Yuan Wu, Ross Girshick, Piotr Doll´ar, and Christoph Feicht- enhofer. Sam 2: Segment anything in images and videos,

  60. [60]

    High-resolution image syn- thesis with latent diffusion models, 2022

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj¨orn Ommer. High-resolution image syn- thesis with latent diffusion models, 2022. 1

  61. [61]

    Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation, 2023

    Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation, 2023. 1

  62. [62]

    Hyperdreambooth: Hypernetworks for fast personalization of text-to-image models, 2024

    Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Wei Wei, Tingbo Hou, Yael Pritch, Neal Wadhwa, Michael Rubinstein, and Kfir Aberman. Hyperdreambooth: Hypernetworks for fast personalization of text-to-image models, 2024. 1

  63. [63]

    Seedream 4.0: Toward next-generation multimodal image generation, 2025

    Team Seedream, Yunpeng Chen, Yu Gao, Lixue Gong, Meng Guo, Qiushan Guo, Zhiyao Guo, Xiaoxia Hou, Weilin Huang, Yixuan Huang, Xiaowen Jian, Huafeng Kuang, Zhichao Lai, Fanshi Li, Liang Li, Xiaochen Lian, Chao Liao, Liyang Liu, Wei Liu, Yanzuo Lu, Zhengxiong Luo, Tong- tong Ou, Guang Shi, Yichun Shi, Shiqi Sun, Yu Tian, Zhi Tian, Peng Wang, Rui Wang, Xun Wa...

  64. [64]

    Addendum to gpt-4o system card: Native image capabilities

    Jerry Sima, Eric Cheng, William Fedus, Miles Brundage, Mark Chen, Iason Gabriel, Sandhini Agarwal, Lilian Weng, et al. Addendum to gpt-4o system card: Native image capabilities. https : / / www . semanticscholar . org / paper / 0c9b799e0dde7dcbe42f8dc61b242a0106739eba,

  65. [65]

    Accessed: 2025-10-18. 1

  66. [66]

    Oriane Sim ´eoni, Huy V . V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Micha ¨el Ramamonjisoa, Francisco Massa, Daniel Haziza, Luca Wehrstedt, Jianyuan Wang, Timoth´ee Darcet, Th´eo Moutakanni, Leonel Sentana, Claire Roberts, Andrea Vedaldi, Jamie Tolan, John Brandt, Camille Couprie,...

  67. [67]

    Insert anything: Image insertion via in-context editing in dit, 2025

    Wensong Song, Hong Jiang, Zongxing Yang, Ruijie Quan, and Yi Yang. Insert anything: Image insertion via in-context editing in dit, 2025. 1

  68. [68]

    Ominicontrol: Minimal and universal control for diffusion transformer, 2025

    Zhenxiong Tan, Songhua Liu, Xingyi Yang, Qiaochu Xue, and Xinchao Wang. Ominicontrol: Minimal and universal control for diffusion transformer, 2025. 3, 10

  69. [69]

    Gpt-4o system card, 2024

    OpenAI Team. Gpt-4o system card, 2024. 4, 6, 7, 13

  70. [70]

    Id-booth: Identity- consistent face generation with diffusion models

    Darian Toma ˇsevi´c, Fadi Boutros, Chenhao Lin, Naser Damer, Vitomir ˇStruc, and Peter Peer. Id-booth: Identity- consistent face generation with diffusion models. In2025 IEEE 19th International Conference on Automatic Face and Gesture Recognition (FG), page 1–10. IEEE, 2025. 1

  71. [71]

    Delving into rl for image generation with cot: A study on dpo vs

    Chengzhuo Tong, Ziyu Guo, Renrui Zhang, Wenyu Shan, Xinyu Wei, Zhenghao Xing, Hongsheng Li, and Pheng-Ann 23 Heng. Delving into rl for image generation with cot: A study on dpo vs. grpo.arXiv preprint arXiv:2505.17017, 2025. 12

  72. [72]

    Delving into rl for image generation with cot: A study on dpo vs

    Chengzhuo Tong, Ziyu Guo, Renrui Zhang, Wenyu Shan, Xinyu Wei, Zhenghao Xing, Hongsheng Li, and Pheng-Ann Heng. Delving into rl for image generation with cot: A study on dpo vs. grpo, 2025. 1

  73. [73]

    Siglip 2: Multilingual vision- language encoders with improved semantic understanding, localization, and dense features, 2025

    Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muham- mad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, Olivier H ´enaff, Jeremiah Harmsen, Andreas Steiner, and Xiaohua Zhai. Siglip 2: Multilingual vision- language encoders with improved semantic understanding, localization, and dense feature...

  74. [74]

    Mulan: A multi layer annotated dataset for controllable text-to-image generation, 2024

    Petru-Daniel Tudosiu, Yongxin Yang, Shifeng Zhang, Fei Chen, Steven McDonagh, Gerasimos Lampouras, Ignacio Ia- cobacci, and Sarah Parisot. Mulan: A multi layer annotated dataset for controllable text-to-image generation, 2024. 3, 10

  75. [75]

    Cloud- device collaborative learning for multimodal large language models

    Guanqun Wang, Jiaming Liu, Chenxuan Li, Yuan Zhang, Junpeng Ma, Xinyu Wei, Kevin Zhang, Maurice Chong, Renrui Zhang, Yijiang Liu, and Shanghang Zhang. Cloud- device collaborative learning for multimodal large language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 12646–12655, 2024. 12

  76. [76]

    Mr-mllm: Mutual reinforcement of multi- modal comprehension and vision perception.arXiv preprint arXiv:2406.15768, 2024

    Guanqun Wang, Xinyu Wei, Jiaming Liu, Ray Zhang, Yichi Zhang, Kevin Zhang, Maurice Chong, and Shang- hang Zhang. Mr-mllm: Mutual reinforcement of multi- modal comprehension and vision perception.arXiv preprint arXiv:2406.15768, 2024. 12

  77. [77]

    Skywork unipic: Unified autoregres- sive modeling for visual understanding and generation, 2025

    Peiyu Wang, Yi Peng, Yimeng Gan, Liang Hu, Tianyidan Xie, Xiaokun Wang, Yichen Wei, Chuanxin Tang, Bo Zhu, Changshi Li, Hongyang Wei, Eric Li, Xuchen Song, Yang Liu, and Yahui Zhou. Skywork unipic: Unified autoregres- sive modeling for visual understanding and generation, 2025. 1

  78. [78]

    Seededit 3.0: Fast and high-quality generative image editing,

    Peng Wang, Yichun Shi, Xiaochen Lian, Zhonghua Zhai, Xin Xia, Xuefeng Xiao, Weilin Huang, and Jianchao Yang. Seededit 3.0: Fast and high-quality generative image editing,

  79. [79]

    Instantid: Zero- shot identity-preserving generation in seconds, 2024

    Qixun Wang, Xu Bai, Haofan Wang, Zekui Qin, Anthony Chen, Huaxia Li, Xu Tang, and Yao Hu. Instantid: Zero- shot identity-preserving generation in seconds, 2024. 1

  80. [80]

    Ms-diffusion: Multi-subject zero-shot image per- sonalization with layout guidance, 2025

    Xierui Wang, Siming Fu, Qihan Huang, Wanggui He, and Hao Jiang. Ms-diffusion: Multi-subject zero-shot image per- sonalization with layout guidance, 2025. 1, 3

Showing first 80 references.