MICo-150K: A Comprehensive Dataset Advancing Multi-Image Composition

Bairui Li; Hongyang Wei; Jinrui Zhang; Kai Cui; Kangrui Cen; Lei Zhang; Xinyu Wei; Zeqing Wang; Zhen Guo

arxiv: 2512.07348 · v2 · submitted 2025-12-08 · 💻 cs.CV

MICo-150K: A Comprehensive Dataset Advancing Multi-Image Composition

Xinyu Wei , Kangrui Cen , Hongyang Wei , Zhen Guo , Kai Cui , Bairui Li , Zeqing Wang , Jinrui Zhang

show 1 more author

Lei Zhang

This is my paper

Pith reviewed 2026-05-17 00:16 UTC · model grok-4.3

classification 💻 cs.CV

keywords multi-image compositionimage datasetcontrollable image generationAI image synthesisbenchmark evaluationmodel fine-tuningidentity consistencydata curation

0 comments

The pith

A 150K dataset of multi-image compositions trains AI models to produce coherent outputs from arbitrary numbers of reference images.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper addresses the shortage of training data for multi-image composition by building MICo-150K, a collection of 150,000 high-quality composite images. The authors organize the task into seven representative categories, generate balanced examples through proprietary models, and apply human filtering and refinement to ensure identity consistency. They add a Decomposition-and-Recomposition subset drawn from 11,000 real-world images to mix synthetic and authentic cases. A dedicated benchmark and metric are introduced to measure progress. Fine-tuning experiments show that the data equips models lacking the skill and strengthens those that already possess some ability, with one baseline matching a limited competitor on three-image cases while handling any input count.

Core claim

The paper claims that MICo-150K, built by categorizing multi-image composition into seven tasks, synthesizing balanced composites with proprietary models, and refining them via human-in-the-loop processes, plus an 11K real-image De&Re subset, supplies the missing training resource. Models fine-tuned on it acquire or improve multi-image composition, as shown by Qwen-MICo matching Qwen-Image-2509 on three-image tasks while supporting arbitrary inputs.

What carries the argument

MICo-150K dataset with its seven-task synthesis pipeline, human refinement, and De&Re real-image subset, paired with MICo-Bench and the Weighted-Ref-VIEScore metric.

If this is right

Models without multi-image composition ability gain this capability after fine-tuning on MICo-150K.
Models that already possess some composition skill show measurable further gains.
The resulting baseline supports arbitrary numbers of input images rather than being restricted to a fixed count.
Standardized evaluation across seven tasks and 300 challenging real-derived cases becomes available through MICo-Bench.
The Weighted-Ref-VIEScore metric supplies a MICo-specific way to score identity preservation and overall quality.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The dataset may support downstream uses such as photo editing tools that combine user-provided references into new scenes.
Extending the decomposition-recomposition method to video frames could help maintain consistency across time.
Testing the trained models on entirely user-supplied real photos without further adaptation would reveal practical limits.
Larger-scale versions of the same curation process might narrow remaining gaps to human-level composition performance.

Load-bearing premise

Images synthesized by proprietary models and refined by humans supply high-quality, identity-consistent training examples that generalize to real-world multi-image composition.

What would settle it

Testing the fine-tuned models on a fresh set of human-composed real photographs outside the synthetic distribution and observing a large drop in identity consistency or coherence compared with MICo-Bench results would falsify generalization.

Figures

Figures reproduced from arXiv: 2512.07348 by Bairui Li, Hongyang Wei, Jinrui Zhang, Kai Cui, Kangrui Cen, Lei Zhang, Xinyu Wei, Zeqing Wang, Zhen Guo.

**Figure 2.** Figure 2: Construction pipeline of MICo-150K. (a) The data construction pipeline for the Human-Centric, Object-Centric, and HOI (Human–Object Interaction) tasks. (b) The pipeline for the De&Re (Decompose and Recompose) task [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Visualization examples from the MICo-150K dataset. [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: High-quality multi-image composition datasets that are [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Traditional VIEScore requires inputting all source and generated images into the evaluator, which often leads to degraded [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Qwen-Image-2509 is trained on a massive-scale dataset [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: The leftmost displays the source and reference images. The first row shows model outputs before fine-tuning, the second row [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗

**Figure 8.** Figure 8: Comparison of open-source models before and after MICo-150K training. Some source images were cropped or background [PITH_FULL_IMAGE:figures/full_fig_p011_8.png] view at source ↗

**Figure 10.** Figure 10: Human-face source images exhibit a Western-centric [PITH_FULL_IMAGE:figures/full_fig_p011_10.png] view at source ↗

**Figure 9.** Figure 9: Nano-Banana [18] produces more realistic images with stronger fidelity to the source inputs and a higher quality ceiling, but occasionally fails on certain cases. GPT-Image-1 [51] exhibits a more stylized, less photo-realistic look, yet remains highly stable and consistently yields semantically coherent results. source element is present in the target image. Objects, clothing items, and scenes. For non-hum… view at source ↗

**Figure 11.** Figure 11: We refer to the metric without the weighting factor [PITH_FULL_IMAGE:figures/full_fig_p013_11.png] view at source ↗

**Figure 12.** Figure 12: Similarity to the reference image does not influence how Weighted-Ref-VIEScore ranks model outputs, demonstrating that the [PITH_FULL_IMAGE:figures/full_fig_p014_12.png] view at source ↗

**Figure 13.** Figure 13: Although PQ and SC scores are computed using GPT-4o, we find no evidence of evaluator–generator coupling: images that are [PITH_FULL_IMAGE:figures/full_fig_p015_13.png] view at source ↗

**Figure 14.** Figure 14: Weighted-Ref-VIEScore effectively prevents copy–paste hacks. We segmented objects or persons from the source images and manually pasted them onto the scene image to form a na¨ıve, unharmonized composite. Although such copy–paste results achieve a perfect weight factor (since every source element appears in the output), their PQ and SC scores remain very low [PITH_FULL_IMAGE:figures/full_fig_p016_14.png] view at source ↗

**Figure 15.** Figure 15: Qwen-MICo consistently outperforms Qwen-Image-2509 across nearly all evaluation dimensions on the MICo-Bench three [PITH_FULL_IMAGE:figures/full_fig_p016_15.png] view at source ↗

**Figure 16.** Figure 16: Qwen-MICo exhibits strong emergent abilities in recognizing and composing complex [PITH_FULL_IMAGE:figures/full_fig_p017_16.png] view at source ↗

**Figure 17.** Figure 17: Qwen-MICo performs well on virtual makeup try-on (transferring the makeup in Image 2 onto the girl in Image 1). 18 [PITH_FULL_IMAGE:figures/full_fig_p018_17.png] view at source ↗

**Figure 18.** Figure 18: Qwen-MICo shows excellent performance on visually complex tasks that demand will introduce a layer of transparent distortion and refracted light, giving the origi [PITH_FULL_IMAGE:figures/full_fig_p019_18.png] view at source ↗

**Figure 19.** Figure 19: Qwen-MICo preserves the subject’s identity while accurately modeling [PITH_FULL_IMAGE:figures/full_fig_p019_19.png] view at source ↗

**Figure 20.** Figure 20: Qwen-MICo preserves the entire appearance of Input 2 while correctly interpreting the prompt phrase [PITH_FULL_IMAGE:figures/full_fig_p020_20.png] view at source ↗

read the original abstract

In controllable image generation, synthesizing coherent and consistent images from multiple reference inputs, i.e., Multi-Image Composition (MICo), remains a challenging problem, partly hindered by the lack of high-quality training data. To bridge this gap, we conduct a systematic study of MICo, categorizing it into 7 representative tasks and curate a large-scale collection of high-quality source images and construct diverse MICo prompts. Leveraging powerful proprietary models, we synthesize a rich amount of balanced composite images, followed by human-in-the-loop filtering and refinement, resulting in MICo-150K, a comprehensive dataset for MICo with identity consistency. We further build a Decomposition-and-Recomposition (De&Re) subset, where 11K real-world complex images are decomposed into components and recomposed, enabling both real and synthetic compositions. To enable comprehensive evaluation, we construct MICo-Bench with 100 cases per task and 300 challenging De&Re cases, and further introduce a new metric, Weighted-Ref-VIEScore, specifically tailored for MICo evaluation. Finally, we fine-tune multiple models on MICo-150K and evaluate them on MICo-Bench. The results show that MICo-150K effectively equips models without MICo capability and further enhances those with existing skills. Notably, our baseline model, Qwen-MICo, fine-tuned from Qwen-Image-Edit, matches Qwen-Image-2509 in 3-image composition while supporting arbitrary multi-image inputs beyond the latter's limitation. Our dataset, benchmark, and baseline collectively offer valuable resources for further research on Multi-Image Composition.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MICo-150K supplies a new dataset and benchmark for multi-image composition with a sensible real-image subset, but the model gains rest on thin visible evidence and untested transfer from synthetics.

read the letter

This paper's main deliverable is MICo-150K, a 150k-example collection for multi-image composition organized around seven tasks, plus the De&Re real-image subset of 11k examples, MICo-Bench, and the Weighted-Ref-VIEScore metric. They generate most examples with proprietary models then apply human filtering, fine-tune a Qwen variant, and report that the resulting model matches a stronger closed model on three-image cases while accepting arbitrary numbers of inputs.

Referee Report

2 major / 2 minor

Summary. The paper introduces MICo-150K, a 150K-example dataset for multi-image composition (MICo) in controllable image generation. It categorizes MICo into 7 tasks, synthesizes balanced composites via proprietary models with human-in-the-loop filtering and refinement, and includes an 11K-example Decomposition-and-Recomposition (De&Re) subset derived from real-world images. The work also presents MICo-Bench (100 cases per task plus 300 De&Re cases) and the Weighted-Ref-VIEScore metric, then demonstrates via fine-tuning that models such as Qwen-MICo (from Qwen-Image-Edit) acquire or improve MICo skills, matching Qwen-Image-2509 on 3-image composition while supporting arbitrary numbers of inputs.

Significance. If the reported gains hold under rigorous scrutiny, the dataset and benchmark would address a clear data scarcity issue in identity-consistent multi-image synthesis, providing community resources that mix synthetic scale with real-image grounding via De&Re. The explicit support for arbitrary input counts beyond fixed baselines is a practical advance for controllable generation pipelines.

major comments (2)

[§5 (Experiments)] §5 (Experiments): The headline claim that Qwen-MICo matches Qwen-Image-2509 on 3-image composition while handling arbitrary inputs is load-bearing for the assertion that MICo-150K 'effectively equips models.' No numerical scores, standard deviations, or full baseline tables are referenced in the evaluation summary, preventing assessment of effect size or statistical reliability.
[§3 (Dataset construction)] §3 (Dataset construction) and §5 (Evaluation): Only 11K of the 150K examples come from real De&Re images; the remainder are proprietary-model synthetics after human filtering. No quantitative identity-consistency metrics (e.g., face-similarity scores or CLIP-based consistency) are reported on held-out real photographs, leaving the transfer assumption from synthetic to real multi-image scenarios untested and central to the generalization claim.

minor comments (2)

[§4 (Benchmark and Metric)] The formal definition and weighting scheme of the new Weighted-Ref-VIEScore metric should be stated explicitly with an equation or pseudocode to allow independent reproduction.
[§3 (Dataset construction)] Filtering criteria and prompt templates used during human-in-the-loop refinement are described at high level; a supplementary table listing exact rejection rates or inter-annotator agreement would improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below, indicating where revisions have been made to strengthen the manuscript.

read point-by-point responses

Referee: [§5 (Experiments)] §5 (Experiments): The headline claim that Qwen-MICo matches Qwen-Image-2509 on 3-image composition while handling arbitrary inputs is load-bearing for the assertion that MICo-150K 'effectively equips models.' No numerical scores, standard deviations, or full baseline tables are referenced in the evaluation summary, preventing assessment of effect size or statistical reliability.

Authors: We agree that the evaluation summary in §5 should explicitly reference quantitative results to support the headline claim. The detailed comparisons, including scores on 3-image composition and arbitrary input counts, along with standard deviations, appear in the full experimental tables. We have revised §5 to include key numerical values from those tables and to direct readers to the complete baseline results for proper assessment of effect sizes. revision: yes
Referee: [§3 (Dataset construction)] §3 (Dataset construction) and §5 (Evaluation): Only 11K of the 150K examples come from real De&Re images; the remainder are proprietary-model synthetics after human filtering. No quantitative identity-consistency metrics (e.g., face-similarity scores or CLIP-based consistency) are reported on held-out real photographs, leaving the transfer assumption from synthetic to real multi-image scenarios untested and central to the generalization claim.

Authors: The referee is correct that the majority of examples are synthetic and that explicit quantitative identity-consistency metrics on held-out real photographs were not reported. The De&Re subset provides real-image grounding, and human filtering was applied for quality control. To address the transfer assumption, we have added new quantitative evaluations (face-similarity and CLIP consistency) on held-out real images to the revised §5. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation or claims

full rationale

The paper's core contribution is a data-curation pipeline that synthesizes composites via proprietary models, applies human-in-the-loop filtering, and constructs a separate benchmark (MICo-Bench) with real De&Re cases. Reported results consist of empirical fine-tuning outcomes on this dataset evaluated against the benchmark and external models (Qwen-Image-Edit, Qwen-Image-2509). No equations, predictions, or central claims reduce by construction to fitted parameters from the same data, no self-citations bear load-bearing weight on uniqueness or ansatzes, and no renaming of known results occurs. The chain remains a standard empirical dataset-plus-benchmark workflow that is self-contained against external model comparisons.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the effectiveness of proprietary-model synthesis plus human filtering to produce usable training data. No free parameters are introduced; the only notable assumption is the domain-level claim that the resulting composites are high-quality and identity-consistent.

axioms (1)

domain assumption Proprietary models can generate high-quality, identity-consistent composite images that become suitable training data after human-in-the-loop filtering and refinement.
Invoked throughout the dataset construction process described in the abstract to justify the utility of MICo-150K.

pith-pipeline@v0.9.0 · 5622 in / 1454 out tokens · 150376 ms · 2026-05-17T00:16:11.119252+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Leveraging powerful proprietary models, we synthesize a rich amount of balanced composite images, followed by human-in-the-loop filtering and refinement, resulting in MICo-150K
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We further build a Decomposition-and-Recomposition (De&Re) subset, where 11K real-world complex images are decomposed into components and recomposed

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

UniCustom: Unified Visual Conditioning for Multi-Reference Image Generation
cs.CV 2026-05 unverdicted novelty 7.0

UniCustom fuses ViT and VAE features before VLM encoding and uses two-stage training plus slot-wise regularization to improve subject consistency in multi-reference diffusion-based image generation.
Banana100: Breaking NR-IQA Metrics by 100 Iterative Image Replications with Nano Banana Pro
cs.CV 2026-04 unverdicted novelty 7.0

Banana100 dataset shows that none of 21 popular NR-IQA metrics consistently rate images degraded by 100 iterative edits lower than clean originals.
UniCustom: Unified Visual Conditioning for Multi-Reference Image Generation
cs.CV 2026-05 unverdicted novelty 6.0

A unified visual conditioning approach fuses semantic and appearance features before VLM processing, with two-stage training and slot-wise regularization, to improve consistency in multi-reference image generation.
LLaDA2.0-Uni: Unifying Multimodal Understanding and Generation with Diffusion Large Language Model
cs.CV 2026-04 unverdicted novelty 6.0

LLaDA2.0-Uni unifies multimodal understanding and generation inside one discrete diffusion large language model with a semantic tokenizer, MoE backbone, and diffusion decoder.

Reference graph

Works this paper leans on

99 extracted references · 99 canonical work pages · cited by 3 Pith papers · 2 internal anchors

[1]

Mc-llava: Multi-concept personalized vision-language model.arXiv preprint arXiv:2411.11706, 2024

Ruichuan An, Sihan Yang, Ming Lu, Renrui Zhang, Kai Zeng, Yulin Luo, Jiajun Cao, Hao Liang, Ying Chen, Qi She, et al. Mc-llava: Multi-concept personalized vision-language model.arXiv preprint arXiv:2411.11706, 2024. 12

work page arXiv 2024
[2]

Unictokens: Boosting personalized understand- ing and generation via unified concept tokens.arXiv preprint arXiv:2505.14671, 2025

Ruichuan An, Sihan Yang, Renrui Zhang, Zijun Shen, Ming Lu, Gaole Dai, Hao Liang, Ziyu Guo, Shilin Yan, Yulin Luo, et al. Unictokens: Boosting personalized understand- ing and generation via unified concept tokens.arXiv preprint arXiv:2505.14671, 2025. 12

work page arXiv 2025
[3]

Qwen2.5-vl technical report, 2025

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhao- hai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report,...

work page 2025
[4]

headshot istockphoto: Headshot image dataset from istockphoto.https : / / huggingface

BKM1804. headshot istockphoto: Headshot image dataset from istockphoto.https : / / huggingface . co / datasets / BKM1804 / headshot _ istockphoto,

work page
[5]

Accessed: 2025-11-05, 6 000 images. 3, 10

work page 2025
[6]

headshot pexels v1: High-resolution headshot dataset from pexels.https : / / huggingface

BKM1804. headshot pexels v1: High-resolution headshot dataset from pexels.https : / / huggingface . co / datasets/BKM1804/headshot_pexels_v1, 2025. Accessed: 2025-11-05, includes 3 000 images. 3, 10

work page 2025
[7]

Tim Brooks, Aleksander Holynski, and Alexei A. Efros. In- structpix2pix: Learning to follow image editing instructions,

work page
[8]

Hunyuanimage 3.0 technical report,

Siyu Cao, Hangting Chen, Peng Chen, Yiji Cheng, Yu- tao Cui, Xinchi Deng, Ying Dong, Kipper Gong, Tianpeng Gu, Xiusen Gu, Tiankai Hang, Duojun Huang, Jie Jiang, Zhengkai Jiang, Weijie Kong, Changlin Li, Donghao Li, Jun- zhe Li, Xin Li, Yang Li, Zhenxi Li, Zhimin Li, Jiaxin Lin, Li- nus, Lucaz Liu, Shu Liu, Songtao Liu, Yu Liu, Yuhong Liu, Yanxin Long, Fan...

work page
[9]

Conceptual 12m: Pushing web-scale image-text pre- training to recognize long-tail visual concepts, 2021

Soravit Changpinyo, Piyush Sharma, Nan Ding, and Radu Soricut. Conceptual 12m: Pushing web-scale image-text pre- training to recognize long-tail visual concepts, 2021. 6, 10

work page 2021
[10]

Xverse: Consistent multi-subject control of identity and semantic attributes via dit modulation, 2025

Bowen Chen, Mengyi Zhao, Haomiao Sun, Li Chen, Xu Wang, Kang Du, and Xinglong Wu. Xverse: Consistent multi-subject control of identity and semantic attributes via dit modulation, 2025. 1, 3

work page 2025
[11]

Disenbooth: Identity- preserving disentangled tuning for subject-driven text-to- image generation, 2024

Hong Chen, Yipeng Zhang, Simin Wu, Xin Wang, Xuguang Duan, Yuwei Zhou, and Wenwu Zhu. Disenbooth: Identity- preserving disentangled tuning for subject-driven text-to- image generation, 2024. 1

work page 2024
[12]

Pixart-α: Fast training of dif- fusion transformer for photorealistic text-to-image synthesis,

Junsong Chen, Jincheng Yu, Chongjian Ge, Lewei Yao, Enze Xie, Yue Wu, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, and Zhenguo Li. Pixart-α: Fast training of dif- fusion transformer for photorealistic text-to-image synthesis,

work page
[13]

Blip3- o: A family of fully open unified multimodal models- architecture, training and dataset, 2025

Jiuhai Chen, Zhiyang Xu, Xichen Pan, Yushi Hu, Can Qin, Tom Goldstein, Lifu Huang, Tianyi Zhou, Saining Xie, Sil- vio Savarese, Le Xue, Caiming Xiong, and Ran Xu. Blip3- o: A family of fully open unified multimodal models- architecture, training and dataset, 2025. 2, 8, 11, 13

work page 2025
[14]

Blip3o-next: Next frontier of native image generation, 2025

Jiuhai Chen, Le Xue, Zhiyang Xu, Xichen Pan, Shusheng Yang, Can Qin, An Yan, Honglu Zhou, Zeyuan Chen, Lifu Huang, Tianyi Zhou, Junnan Li, Silvio Savarese, Caiming Xiong, and Ran Xu. Blip3o-next: Next frontier of native image generation, 2025. 8

work page 2025
[15]

Lamic: Layout-aware multi- image composition via scalability of multimodal diffusion transformer, 2025

Yuzhuo Chen, Zehua Ma, Jianhua Wang, Kai Kang, Shunyu Yao, and Weiming Zhang. Lamic: Layout-aware multi- image composition via scalability of multimodal diffusion transformer, 2025. 1, 3

work page 2025
[16]

Viton-hd: High-resolution virtual try-on via misalignment-aware normalization, 2021

Seunghwan Choi, Sunghyun Park, Minsoo Lee, and Jaegul Choo. Viton-hd: High-resolution virtual try-on via misalignment-aware normalization, 2021. 3, 10

work page 2021
[17]

Idadapter: Learn- ing mixed features for tuning-free personalization of text-to- image models

Siying Cui, Jia Guo, Xiang An, Jiankang Deng, Yongle Zhao, Xinyu Wei, and Ziyong Feng. Idadapter: Learn- ing mixed features for tuning-free personalization of text-to- image models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Work- shops, pages 950–959, 2024. 1

work page 2024
[18]

Idadapter: Learn- ing mixed features for tuning-free personalization of text-to- image models, 2024

Siying Cui, Jia Guo, Xiang An, Jiankang Deng, Yongle Zhao, Xinyu Wei, and Ziyong Feng. Idadapter: Learn- ing mixed features for tuning-free personalization of text-to- image models, 2024. 1

work page 2024
[19]

Introducing gemini 1.5 flash: Fast, efficient, and multimodal.https : / / developers

Google DeepMind. Introducing gemini 1.5 flash: Fast, efficient, and multimodal.https : / / developers . googleblog.com/en/introducing- gemini- 2- 5-flash-image/, 2024. Accessed: 2025-10-18. 1, 2, 4, 5, 7, 10, 12, 13, 14

work page 2024
[20]

Emerging Properties in Unified Multimodal Pretraining

Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, Guang Shi, and Haoqi Fan. Emerging properties in unified multimodal pretraining.arXiv preprint arXiv:2505.14683, 2025. 2, 8, 11, 13

work page internal anchor Pith review Pith/arXiv arXiv 2025
[21]

Emerging properties in unified multimodal pretraining, 2025

Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, Guang Shi, and Haoqi Fan. Emerging properties in unified multimodal pretraining, 2025. 1

work page 2025
[22]

Arcface: Additive angular mar- gin loss for deep face recognition.IEEE Transactions on Pat- tern Analysis and Machine Intelligence, 44(10):5962–5979,

Jiankang Deng, Jia Guo, Jing Yang, Niannan Xue, Irene Kot- sia, and Stefanos Zafeiriou. Arcface: Additive angular mar- gin loss for deep face recognition.IEEE Transactions on Pat- tern Analysis and Machine Intelligence, 44(10):5962–5979,

work page
[23]

Scaling rectified flow trans- formers for high-resolution image synthesis, 2024

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas M ¨uller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, Dustin Podell, Tim 21 Dockhorn, Zion English, Kyle Lacey, Alex Goodwin, Yan- nik Marek, and Robin Rombach. Scaling rectified flow trans- formers for high-resolution image synthesis, 2024. 1

work page 2024
[24]

A density-based algorithm for discovering clusters in large spatial databases with noise

Martin Ester, Hans-Peter Kriegel, J ¨org Sander, and Xiaowei Xu. A density-based algorithm for discovering clusters in large spatial databases with noise. InKnowledge Discovery and Data Mining, 1996. 3

work page 1996
[25]

Seedream 3.0 technical report, 2025

Yu Gao, Lixue Gong, Qiushan Guo, Xiaoxia Hou, Zhichao Lai, Fanshi Li, Liang Li, Xiaochen Lian, Chao Liao, Liyang Liu, Wei Liu, Yichun Shi, Shiqi Sun, Yu Tian, Zhi Tian, Peng Wang, Rui Wang, Xuanda Wang, Xun Wang, Ye Wang, Guofeng Wu, Jie Wu, Xin Xia, Xuefeng Xiao, Zhonghua Zhai, Xinyu Zhang, Qi Zhang, Yuwei Zhang, Shijia Zhao, Jianchao Yang, and Weilin Hu...

work page 2025
[26]

See- dream 2.0: A native chinese-english bilingual image genera- tion foundation model, 2025

Lixue Gong, Xiaoxia Hou, Fanshi Li, Liang Li, Xiaochen Lian, Fei Liu, Liyang Liu, Wei Liu, Wei Lu, Yichun Shi, Shiqi Sun, Yu Tian, Zhi Tian, Peng Wang, Xun Wang, Ye Wang, Guofeng Wu, Jie Wu, Xin Xia, Xuefeng Xiao, Lin- jie Yang, Zhonghua Zhai, Xinyu Zhang, Qi Zhang, Yuwei Zhang, Shijia Zhao, Jianchao Yang, and Weilin Huang. See- dream 2.0: A native chines...

work page 2025
[27]

Pulid: Pure and lightning id customiza- tion via contrastive alignment, 2024

Zinan Guo, Yanze Wu, Zhuowei Chen, Lang Chen, Peng Zhang, and Qian He. Pulid: Pure and lightning id customiza- tion via contrastive alignment, 2024. 1, 3

work page 2024
[28]

Are video models ready as zero-shot reasoners? an empirical study with the mme-cof benchmark.arXiv preprint arXiv:2510.26802, 2025

Ziyu Guo, Xinyan Chen, Renrui Zhang, Ruichuan An, Yu Qi, Dongzhi Jiang, Xiangtai Li, Manyuan Zhang, Hongsheng Li, and Pheng-Ann Heng. Are video models ready as zero-shot reasoners? an empirical study with the mme-cof benchmark. arXiv preprint arXiv:2510.26802, 2025. 14

work page arXiv 2025
[29]

Thinking-while- generating: Interleaving textual reasoning throughout vi- sual generation.arXiv preprint arXiv:2511.16671, 2025a

Ziyu Guo, Renrui Zhang, Hongyu Li, Manyuan Zhang, Xinyan Chen, Sifan Wang, Yan Feng, Peng Pei, and Pheng- Ann Heng. Thinking-while-generating: Interleaving tex- tual reasoning throughout visual generation.arXiv preprint arXiv:2511.16671, 2025. 1

work page arXiv 2025
[30]

Can we generate images with cot? let’s verify and reinforce image generation step by step,

Ziyu Guo, Renrui Zhang, Chengzhuo Tong, Zhizheng Zhao, Rui Huang, Haoquan Zhang, Manyuan Zhang, Jiaming Liu, Shanghang Zhang, Peng Gao, et al. Can we generate images with cot? let’s verify and reinforce image generation step by step.arXiv preprint arXiv:2501.13926, 2025. 1

work page arXiv 2025
[31]

Ella: Equip diffusion models with llm for en- hanced semantic alignment, 2024

Xiwei Hu, Rui Wang, Yixiao Fang, Bin Fu, Pei Cheng, and Gang Yu. Ella: Equip diffusion models with llm for en- hanced semantic alignment, 2024. 14

work page 2024
[32]

Resolving multi-condition confusion for finetuning-free personalized image generation, 2024

Qihan Huang, Siming Fu, Jinlong Liu, Hao Jiang, Yipeng Yu, and Jie Song. Resolving multi-condition confusion for finetuning-free personalized image generation, 2024. 1, 3

work page 2024
[33]

T2i-r1: Reinforcing image generation with collaborative semantic-level and token-level cot.arXiv preprint arXiv:2505.00703, 2025

Dongzhi Jiang, Ziyu Guo, Renrui Zhang, Zhuofan Zong, Hao Li, Le Zhuo, Shilin Yan, Pheng-Ann Heng, and Hong- sheng Li. T2i-r1: Reinforcing image generation with col- laborative semantic-level and token-level cot.arXiv preprint arXiv:2505.00703, 2025. 1

work page arXiv 2025
[34]

Berg, Wan-Yen Lo, Piotr Doll ´ar, and Ross Girshick

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer White- head, Alexander C. Berg, Wan-Yen Lo, Piotr Doll ´ar, and Ross Girshick. Segment anything, 2023. 1, 2, 3

work page 2023
[35]

Viescore: Towards explainable metrics for conditional image synthesis evaluation, 2024

Max Ku, Dongfu Jiang, Cong Wei, Xiang Yue, and Wenhu Chen. Viescore: Towards explainable metrics for conditional image synthesis evaluation, 2024. 7

work page 2024
[36]

Harold W. Kuhn. The hungarian method for the assignment problem.Naval Research Logistics (NRL), 52, 1955. 6

work page 1955
[37]

Flux.1 – official inference repository for flux models.https : / / github

Black Forest Labs. Flux.1 – official inference repository for flux models.https : / / github . com / black - forest-labs/flux, 2024. Accessed: 2025-10-18. 1

work page 2024
[38]

Flux.1 kontext: Flow matching for in-context image generation and editing in latent space,

Black Forest Labs, Stephen Batifol, Andreas Blattmann, Frederic Boesel, Saksham Consul, Cyril Diagne, Tim Dock- horn, Jack English, Zion English, Patrick Esser, Sumith Ku- lal, Kyle Lacey, Yam Levi, Cheng Li, Dominik Lorenz, Jonas M¨uller, Dustin Podell, Robin Rombach, Harry Saini, Axel Sauer, and Luke Smith. Flux.1 kontext: Flow matching for in-context i...

work page
[39]

Photomaker: Customizing re- alistic human photos via stacked id embedding, 2023

Zhen Li, Mingdeng Cao, Xintao Wang, Zhongang Qi, Ming- Ming Cheng, and Ying Shan. Photomaker: Customizing re- alistic human photos via stacked id embedding, 2023. 1

work page 2023
[40]

Hunyuan-dit: A powerful multi-resolution diffusion trans- former with fine-grained chinese understanding, 2024

Zhimin Li, Jianwei Zhang, Qin Lin, Jiangfeng Xiong, Yanxin Long, Xinchi Deng, Yingfang Zhang, Xingchao Liu, Minbin Huang, Zedong Xiao, Dayou Chen, Jiajun He, Jiahao Li, Wenyue Li, Chen Zhang, Rongwei Quan, Jianxiang Lu, Jiabin Huang, Xiaoyan Yuan, Xiaoxiao Zheng, Yixuan Li, Ji- hong Zhang, Chao Zhang, Meng Chen, Jie Liu, Zheng Fang, Weiyan Wang, Jinbao Xu...

work page 2024
[41]

Scale-aware modulation meet transformer

Weifeng Lin, Ziheng Wu, Jiayu Chen, Jun Huang, and Lian- wen Jin. Scale-aware modulation meet transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 6015–6026, 2023. 12

work page 2023
[42]

Draw-and-understand: Leveraging visual prompts to enable mllms to comprehend what you want.arXiv preprint arXiv:2403.20271,

Weifeng Lin, Xinyu Wei, Ruichuan An, Peng Gao, Bocheng Zou, Yulin Luo, Siyuan Huang, Shanghang Zhang, and Hongsheng Li. Draw-and-understand: Leveraging visual prompts to enable mllms to comprehend what you want. arXiv preprint arXiv:2403.20271, 2024. 12

work page arXiv 2024
[43]

Pixwizard: Versatile image-to-image visual assis- 15 tant with open-language instructions,

Weifeng Lin, Xinyu Wei, Renrui Zhang, Le Zhuo, Shitian Zhao, Siyuan Huang, Huan Teng, Junlin Xie, Yu Qiao, Peng Gao, et al. Pixwizard: Versatile image-to-image visual assistant with open-language instructions.arXiv preprint arXiv:2409.15278, 2024. 1

work page arXiv 2024
[44]

Perceive anything: Recog- nize, explain, caption, and segment anything in images and videos.arXiv preprint arXiv:2506.05302, 2025

Weifeng Lin, Xinyu Wei, Ruichuan An, Tianhe Ren, Tingwei Chen, Renrui Zhang, Ziyu Guo, Wentao Zhang, Lei Zhang, and Hongsheng Li. Perceive anything: Recognize, explain, caption, and segment anything in images and videos.arXiv preprint arXiv:2506.05302, 2025. 12

work page arXiv 2025
[45]

Grounding dino: Marry- ing dino with grounded pre-training for open-set object de- tection, 2024

Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, and Lei Zhang. Grounding dino: Marry- ing dino with grounded pre-training for open-set object de- tection, 2024. 1, 2, 3

work page 2024
[46]

Step1x-edit: A practical framework for gen- eral image editing, 2025

Shiyu Liu, Yucheng Han, Peng Xing, Fukun Yin, Rui Wang, Wei Cheng, Jiaqi Liao, Yingming Wang, Honghao Fu, Chun- rui Han, Guopeng Li, Yuang Peng, Quan Sun, Jingwei Wu, 22 Yan Cai, Zheng Ge, Ranchen Ming, Lei Xia, Xianfang Zeng, Yibo Zhu, Binxing Jiao, Xiangyu Zhang, Gang Yu, and Daxin Jiang. Step1x-edit: A practical framework for gen- eral image editing, 2025. 1

work page 2025
[47]

Llm as dataset ana- lyst: Subpopulation structure discovery with large language model

Yulin Luo, Ruichuan An, Bocheng Zou, Yiming Tang, Ji- aming Liu, and Shanghang Zhang. Llm as dataset ana- lyst: Subpopulation structure discovery with large language model. InEuropean Conference on Computer Vision, pages 235–252. Springer, 2024. 12

work page 2024
[48]

Subject- diffusion:open domain personalized text-to-image genera- tion without test-time fine-tuning, 2024

Jian Ma, Junhao Liang, Chen Chen, and Haonan Lu. Subject- diffusion:open domain personalized text-to-image genera- tion without test-time fine-tuning, 2024. 1, 3

work page 2024
[49]

T2i- adapter: Learning adapters to dig out more controllable abil- ity for text-to-image diffusion models, 2023

Chong Mou, Xintao Wang, Liangbin Xie, Yanze Wu, Jian Zhang, Zhongang Qi, Ying Shan, and Xiaohu Qie. T2i- adapter: Learning adapters to dig out more controllable abil- ity for text-to-image diffusion models, 2023. 1

work page 2023
[50]

Dreamo: A unified framework for image cus- tomization, 2025

Chong Mou, Yanze Wu, Wenxu Wu, Zinan Guo, Pengze Zhang, Yufeng Cheng, Yiming Luo, Fei Ding, Shiwen Zhang, Xinghui Li, Mengtian Li, Mingcong Liu, Yi Zhang, Shaojin Wu, Songtao Zhao, Jian Zhang, Qian He, and Xin- glong Wu. Dreamo: A unified framework for image cus- tomization, 2025. 1, 3

work page 2025
[51]

Dall·e 3.https://openai.com/index/ dall-e-3/, 2023

OpenAI. Dall·e 3.https://openai.com/index/ dall-e-3/, 2023. Accessed: 2025-10-14. 1

work page 2023
[52]

Gpt-4o image generation.https : / / openai

OpenAI. Gpt-4o image generation.https : / / openai . com / index / introducing - 4o - image - generation/, 2025. Accessed: 2025-10-18. 1, 2, 10, 12, 13, 14

work page 2025
[53]

Dinov2: Learning robust visual features with- out supervision, 2024

Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mah- moud Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Herv ´e Je- gou, Julien Mairal, ...

work page 2024
[54]

Conceptbed: Evaluating concept learning abilities of text-to-image diffusion models, 2024

Maitreya Patel, Tejas Gokhale, Chitta Baral, and Yezhou Yang. Conceptbed: Evaluating concept learning abilities of text-to-image diffusion models, 2024. 1

work page 2024
[55]

Maitreya Patel, Sangmin Jung, Chitta Baral, and Yezhou Yang.λ-eclipse: Multi-concept personalized text-to-image diffusion models by leveraging clip latent space, 2024. 1, 3

work page 2024
[56]

Sdxl: Improving latent diffusion models for high-resolution image synthesis, 2023

Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas M ¨uller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis, 2023. 1

work page 2023
[57]

Unicontrol: A unified diffusion model for controllable visual generation in the wild, 2023

Can Qin, Shu Zhang, Ning Yu, Yihao Feng, Xinyi Yang, Yingbo Zhou, Huan Wang, Juan Carlos Niebles, Caiming Xiong, Silvio Savarese, Stefano Ermon, Yun Fu, and Ran Xu. Unicontrol: A unified diffusion model for controllable visual generation in the wild, 2023. 1

work page 2023
[58]

Qwen-image-edit-2509.https : / / huggingface

QwenLM. Qwen-image-edit-2509.https : / / huggingface . co / Qwen / Qwen - Image - Edit - 2509, 2025. Accessed: 2025-11-12. 3, 8, 15

work page 2025
[59]

Sam 2: Segment anything in images and videos,

Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman R¨adle, Chloe Rolland, Laura Gustafson, Eric Mintun, Junt- ing Pan, Kalyan Vasudev Alwala, Nicolas Carion, Chao- Yuan Wu, Ross Girshick, Piotr Doll´ar, and Christoph Feicht- enhofer. Sam 2: Segment anything in images and videos,

work page
[60]

High-resolution image syn- thesis with latent diffusion models, 2022

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj¨orn Ommer. High-resolution image syn- thesis with latent diffusion models, 2022. 1

work page 2022
[61]

Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation, 2023

Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation, 2023. 1

work page 2023
[62]

Hyperdreambooth: Hypernetworks for fast personalization of text-to-image models, 2024

Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Wei Wei, Tingbo Hou, Yael Pritch, Neal Wadhwa, Michael Rubinstein, and Kfir Aberman. Hyperdreambooth: Hypernetworks for fast personalization of text-to-image models, 2024. 1

work page 2024
[63]

Seedream 4.0: Toward next-generation multimodal image generation, 2025

Team Seedream, Yunpeng Chen, Yu Gao, Lixue Gong, Meng Guo, Qiushan Guo, Zhiyao Guo, Xiaoxia Hou, Weilin Huang, Yixuan Huang, Xiaowen Jian, Huafeng Kuang, Zhichao Lai, Fanshi Li, Liang Li, Xiaochen Lian, Chao Liao, Liyang Liu, Wei Liu, Yanzuo Lu, Zhengxiong Luo, Tong- tong Ou, Guang Shi, Yichun Shi, Shiqi Sun, Yu Tian, Zhi Tian, Peng Wang, Rui Wang, Xun Wa...

work page 2025
[64]

Addendum to gpt-4o system card: Native image capabilities

Jerry Sima, Eric Cheng, William Fedus, Miles Brundage, Mark Chen, Iason Gabriel, Sandhini Agarwal, Lilian Weng, et al. Addendum to gpt-4o system card: Native image capabilities. https : / / www . semanticscholar . org / paper / 0c9b799e0dde7dcbe42f8dc61b242a0106739eba,

work page
[65]

Accessed: 2025-10-18. 1

work page 2025
[66]

Oriane Sim ´eoni, Huy V . V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Micha ¨el Ramamonjisoa, Francisco Massa, Daniel Haziza, Luca Wehrstedt, Jianyuan Wang, Timoth´ee Darcet, Th´eo Moutakanni, Leonel Sentana, Claire Roberts, Andrea Vedaldi, Jamie Tolan, John Brandt, Camille Couprie,...

work page 2025
[67]

Insert anything: Image insertion via in-context editing in dit, 2025

Wensong Song, Hong Jiang, Zongxing Yang, Ruijie Quan, and Yi Yang. Insert anything: Image insertion via in-context editing in dit, 2025. 1

work page 2025
[68]

Ominicontrol: Minimal and universal control for diffusion transformer, 2025

Zhenxiong Tan, Songhua Liu, Xingyi Yang, Qiaochu Xue, and Xinchao Wang. Ominicontrol: Minimal and universal control for diffusion transformer, 2025. 3, 10

work page 2025
[69]

Gpt-4o system card, 2024

OpenAI Team. Gpt-4o system card, 2024. 4, 6, 7, 13

work page 2024
[70]

Id-booth: Identity- consistent face generation with diffusion models

Darian Toma ˇsevi´c, Fadi Boutros, Chenhao Lin, Naser Damer, Vitomir ˇStruc, and Peter Peer. Id-booth: Identity- consistent face generation with diffusion models. In2025 IEEE 19th International Conference on Automatic Face and Gesture Recognition (FG), page 1–10. IEEE, 2025. 1

work page 2025
[71]

Delving into rl for image generation with cot: A study on dpo vs

Chengzhuo Tong, Ziyu Guo, Renrui Zhang, Wenyu Shan, Xinyu Wei, Zhenghao Xing, Hongsheng Li, and Pheng-Ann 23 Heng. Delving into rl for image generation with cot: A study on dpo vs. grpo.arXiv preprint arXiv:2505.17017, 2025. 12

work page arXiv 2025
[72]

Delving into rl for image generation with cot: A study on dpo vs

Chengzhuo Tong, Ziyu Guo, Renrui Zhang, Wenyu Shan, Xinyu Wei, Zhenghao Xing, Hongsheng Li, and Pheng-Ann Heng. Delving into rl for image generation with cot: A study on dpo vs. grpo, 2025. 1

work page 2025
[73]

Siglip 2: Multilingual vision- language encoders with improved semantic understanding, localization, and dense features, 2025

Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muham- mad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, Olivier H ´enaff, Jeremiah Harmsen, Andreas Steiner, and Xiaohua Zhai. Siglip 2: Multilingual vision- language encoders with improved semantic understanding, localization, and dense feature...

work page 2025
[74]

Mulan: A multi layer annotated dataset for controllable text-to-image generation, 2024

Petru-Daniel Tudosiu, Yongxin Yang, Shifeng Zhang, Fei Chen, Steven McDonagh, Gerasimos Lampouras, Ignacio Ia- cobacci, and Sarah Parisot. Mulan: A multi layer annotated dataset for controllable text-to-image generation, 2024. 3, 10

work page 2024
[75]

Cloud- device collaborative learning for multimodal large language models

Guanqun Wang, Jiaming Liu, Chenxuan Li, Yuan Zhang, Junpeng Ma, Xinyu Wei, Kevin Zhang, Maurice Chong, Renrui Zhang, Yijiang Liu, and Shanghang Zhang. Cloud- device collaborative learning for multimodal large language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 12646–12655, 2024. 12

work page 2024
[76]

Mr-mllm: Mutual reinforcement of multi- modal comprehension and vision perception.arXiv preprint arXiv:2406.15768, 2024

Guanqun Wang, Xinyu Wei, Jiaming Liu, Ray Zhang, Yichi Zhang, Kevin Zhang, Maurice Chong, and Shang- hang Zhang. Mr-mllm: Mutual reinforcement of multi- modal comprehension and vision perception.arXiv preprint arXiv:2406.15768, 2024. 12

work page arXiv 2024
[77]

Skywork unipic: Unified autoregres- sive modeling for visual understanding and generation, 2025

Peiyu Wang, Yi Peng, Yimeng Gan, Liang Hu, Tianyidan Xie, Xiaokun Wang, Yichen Wei, Chuanxin Tang, Bo Zhu, Changshi Li, Hongyang Wei, Eric Li, Xuchen Song, Yang Liu, and Yahui Zhou. Skywork unipic: Unified autoregres- sive modeling for visual understanding and generation, 2025. 1

work page 2025
[78]

Seededit 3.0: Fast and high-quality generative image editing,

Peng Wang, Yichun Shi, Xiaochen Lian, Zhonghua Zhai, Xin Xia, Xuefeng Xiao, Weilin Huang, and Jianchao Yang. Seededit 3.0: Fast and high-quality generative image editing,

work page
[79]

Instantid: Zero- shot identity-preserving generation in seconds, 2024

Qixun Wang, Xu Bai, Haofan Wang, Zekui Qin, Anthony Chen, Huaxia Li, Xu Tang, and Yao Hu. Instantid: Zero- shot identity-preserving generation in seconds, 2024. 1

work page 2024
[80]

Ms-diffusion: Multi-subject zero-shot image per- sonalization with layout guidance, 2025

Xierui Wang, Siming Fu, Qihan Huang, Wanggui He, and Hao Jiang. Ms-diffusion: Multi-subject zero-shot image per- sonalization with layout guidance, 2025. 1, 3

work page 2025

Showing first 80 references.

[1] [1]

Mc-llava: Multi-concept personalized vision-language model.arXiv preprint arXiv:2411.11706, 2024

Ruichuan An, Sihan Yang, Ming Lu, Renrui Zhang, Kai Zeng, Yulin Luo, Jiajun Cao, Hao Liang, Ying Chen, Qi She, et al. Mc-llava: Multi-concept personalized vision-language model.arXiv preprint arXiv:2411.11706, 2024. 12

work page arXiv 2024

[2] [2]

Unictokens: Boosting personalized understand- ing and generation via unified concept tokens.arXiv preprint arXiv:2505.14671, 2025

Ruichuan An, Sihan Yang, Renrui Zhang, Zijun Shen, Ming Lu, Gaole Dai, Hao Liang, Ziyu Guo, Shilin Yan, Yulin Luo, et al. Unictokens: Boosting personalized understand- ing and generation via unified concept tokens.arXiv preprint arXiv:2505.14671, 2025. 12

work page arXiv 2025

[3] [3]

Qwen2.5-vl technical report, 2025

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhao- hai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report,...

work page 2025

[4] [4]

headshot istockphoto: Headshot image dataset from istockphoto.https : / / huggingface

BKM1804. headshot istockphoto: Headshot image dataset from istockphoto.https : / / huggingface . co / datasets / BKM1804 / headshot _ istockphoto,

work page

[5] [5]

Accessed: 2025-11-05, 6 000 images. 3, 10

work page 2025

[6] [6]

headshot pexels v1: High-resolution headshot dataset from pexels.https : / / huggingface

BKM1804. headshot pexels v1: High-resolution headshot dataset from pexels.https : / / huggingface . co / datasets/BKM1804/headshot_pexels_v1, 2025. Accessed: 2025-11-05, includes 3 000 images. 3, 10

work page 2025

[7] [7]

Tim Brooks, Aleksander Holynski, and Alexei A. Efros. In- structpix2pix: Learning to follow image editing instructions,

work page

[8] [8]

Hunyuanimage 3.0 technical report,

Siyu Cao, Hangting Chen, Peng Chen, Yiji Cheng, Yu- tao Cui, Xinchi Deng, Ying Dong, Kipper Gong, Tianpeng Gu, Xiusen Gu, Tiankai Hang, Duojun Huang, Jie Jiang, Zhengkai Jiang, Weijie Kong, Changlin Li, Donghao Li, Jun- zhe Li, Xin Li, Yang Li, Zhenxi Li, Zhimin Li, Jiaxin Lin, Li- nus, Lucaz Liu, Shu Liu, Songtao Liu, Yu Liu, Yuhong Liu, Yanxin Long, Fan...

work page

[9] [9]

Conceptual 12m: Pushing web-scale image-text pre- training to recognize long-tail visual concepts, 2021

Soravit Changpinyo, Piyush Sharma, Nan Ding, and Radu Soricut. Conceptual 12m: Pushing web-scale image-text pre- training to recognize long-tail visual concepts, 2021. 6, 10

work page 2021

[10] [10]

Xverse: Consistent multi-subject control of identity and semantic attributes via dit modulation, 2025

Bowen Chen, Mengyi Zhao, Haomiao Sun, Li Chen, Xu Wang, Kang Du, and Xinglong Wu. Xverse: Consistent multi-subject control of identity and semantic attributes via dit modulation, 2025. 1, 3

work page 2025

[11] [11]

Disenbooth: Identity- preserving disentangled tuning for subject-driven text-to- image generation, 2024

Hong Chen, Yipeng Zhang, Simin Wu, Xin Wang, Xuguang Duan, Yuwei Zhou, and Wenwu Zhu. Disenbooth: Identity- preserving disentangled tuning for subject-driven text-to- image generation, 2024. 1

work page 2024

[12] [12]

Pixart-α: Fast training of dif- fusion transformer for photorealistic text-to-image synthesis,

Junsong Chen, Jincheng Yu, Chongjian Ge, Lewei Yao, Enze Xie, Yue Wu, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, and Zhenguo Li. Pixart-α: Fast training of dif- fusion transformer for photorealistic text-to-image synthesis,

work page

[13] [13]

Blip3- o: A family of fully open unified multimodal models- architecture, training and dataset, 2025

Jiuhai Chen, Zhiyang Xu, Xichen Pan, Yushi Hu, Can Qin, Tom Goldstein, Lifu Huang, Tianyi Zhou, Saining Xie, Sil- vio Savarese, Le Xue, Caiming Xiong, and Ran Xu. Blip3- o: A family of fully open unified multimodal models- architecture, training and dataset, 2025. 2, 8, 11, 13

work page 2025

[14] [14]

Blip3o-next: Next frontier of native image generation, 2025

Jiuhai Chen, Le Xue, Zhiyang Xu, Xichen Pan, Shusheng Yang, Can Qin, An Yan, Honglu Zhou, Zeyuan Chen, Lifu Huang, Tianyi Zhou, Junnan Li, Silvio Savarese, Caiming Xiong, and Ran Xu. Blip3o-next: Next frontier of native image generation, 2025. 8

work page 2025

[15] [15]

Lamic: Layout-aware multi- image composition via scalability of multimodal diffusion transformer, 2025

Yuzhuo Chen, Zehua Ma, Jianhua Wang, Kai Kang, Shunyu Yao, and Weiming Zhang. Lamic: Layout-aware multi- image composition via scalability of multimodal diffusion transformer, 2025. 1, 3

work page 2025

[16] [16]

Viton-hd: High-resolution virtual try-on via misalignment-aware normalization, 2021

Seunghwan Choi, Sunghyun Park, Minsoo Lee, and Jaegul Choo. Viton-hd: High-resolution virtual try-on via misalignment-aware normalization, 2021. 3, 10

work page 2021

[17] [17]

Idadapter: Learn- ing mixed features for tuning-free personalization of text-to- image models

Siying Cui, Jia Guo, Xiang An, Jiankang Deng, Yongle Zhao, Xinyu Wei, and Ziyong Feng. Idadapter: Learn- ing mixed features for tuning-free personalization of text-to- image models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Work- shops, pages 950–959, 2024. 1

work page 2024

[18] [18]

Idadapter: Learn- ing mixed features for tuning-free personalization of text-to- image models, 2024

Siying Cui, Jia Guo, Xiang An, Jiankang Deng, Yongle Zhao, Xinyu Wei, and Ziyong Feng. Idadapter: Learn- ing mixed features for tuning-free personalization of text-to- image models, 2024. 1

work page 2024

[19] [19]

Introducing gemini 1.5 flash: Fast, efficient, and multimodal.https : / / developers

Google DeepMind. Introducing gemini 1.5 flash: Fast, efficient, and multimodal.https : / / developers . googleblog.com/en/introducing- gemini- 2- 5-flash-image/, 2024. Accessed: 2025-10-18. 1, 2, 4, 5, 7, 10, 12, 13, 14

work page 2024

[20] [20]

Emerging Properties in Unified Multimodal Pretraining

Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, Guang Shi, and Haoqi Fan. Emerging properties in unified multimodal pretraining.arXiv preprint arXiv:2505.14683, 2025. 2, 8, 11, 13

work page internal anchor Pith review Pith/arXiv arXiv 2025

[21] [21]

Emerging properties in unified multimodal pretraining, 2025

Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, Guang Shi, and Haoqi Fan. Emerging properties in unified multimodal pretraining, 2025. 1

work page 2025

[22] [22]

Arcface: Additive angular mar- gin loss for deep face recognition.IEEE Transactions on Pat- tern Analysis and Machine Intelligence, 44(10):5962–5979,

Jiankang Deng, Jia Guo, Jing Yang, Niannan Xue, Irene Kot- sia, and Stefanos Zafeiriou. Arcface: Additive angular mar- gin loss for deep face recognition.IEEE Transactions on Pat- tern Analysis and Machine Intelligence, 44(10):5962–5979,

work page

[23] [23]

Scaling rectified flow trans- formers for high-resolution image synthesis, 2024

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas M ¨uller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, Dustin Podell, Tim 21 Dockhorn, Zion English, Kyle Lacey, Alex Goodwin, Yan- nik Marek, and Robin Rombach. Scaling rectified flow trans- formers for high-resolution image synthesis, 2024. 1

work page 2024

[24] [24]

A density-based algorithm for discovering clusters in large spatial databases with noise

Martin Ester, Hans-Peter Kriegel, J ¨org Sander, and Xiaowei Xu. A density-based algorithm for discovering clusters in large spatial databases with noise. InKnowledge Discovery and Data Mining, 1996. 3

work page 1996

[25] [25]

Seedream 3.0 technical report, 2025

Yu Gao, Lixue Gong, Qiushan Guo, Xiaoxia Hou, Zhichao Lai, Fanshi Li, Liang Li, Xiaochen Lian, Chao Liao, Liyang Liu, Wei Liu, Yichun Shi, Shiqi Sun, Yu Tian, Zhi Tian, Peng Wang, Rui Wang, Xuanda Wang, Xun Wang, Ye Wang, Guofeng Wu, Jie Wu, Xin Xia, Xuefeng Xiao, Zhonghua Zhai, Xinyu Zhang, Qi Zhang, Yuwei Zhang, Shijia Zhao, Jianchao Yang, and Weilin Hu...

work page 2025

[26] [26]

See- dream 2.0: A native chinese-english bilingual image genera- tion foundation model, 2025

Lixue Gong, Xiaoxia Hou, Fanshi Li, Liang Li, Xiaochen Lian, Fei Liu, Liyang Liu, Wei Liu, Wei Lu, Yichun Shi, Shiqi Sun, Yu Tian, Zhi Tian, Peng Wang, Xun Wang, Ye Wang, Guofeng Wu, Jie Wu, Xin Xia, Xuefeng Xiao, Lin- jie Yang, Zhonghua Zhai, Xinyu Zhang, Qi Zhang, Yuwei Zhang, Shijia Zhao, Jianchao Yang, and Weilin Huang. See- dream 2.0: A native chines...

work page 2025

[27] [27]

Pulid: Pure and lightning id customiza- tion via contrastive alignment, 2024

Zinan Guo, Yanze Wu, Zhuowei Chen, Lang Chen, Peng Zhang, and Qian He. Pulid: Pure and lightning id customiza- tion via contrastive alignment, 2024. 1, 3

work page 2024

[28] [28]

Are video models ready as zero-shot reasoners? an empirical study with the mme-cof benchmark.arXiv preprint arXiv:2510.26802, 2025

Ziyu Guo, Xinyan Chen, Renrui Zhang, Ruichuan An, Yu Qi, Dongzhi Jiang, Xiangtai Li, Manyuan Zhang, Hongsheng Li, and Pheng-Ann Heng. Are video models ready as zero-shot reasoners? an empirical study with the mme-cof benchmark. arXiv preprint arXiv:2510.26802, 2025. 14

work page arXiv 2025

[29] [29]

Thinking-while- generating: Interleaving textual reasoning throughout vi- sual generation.arXiv preprint arXiv:2511.16671, 2025a

Ziyu Guo, Renrui Zhang, Hongyu Li, Manyuan Zhang, Xinyan Chen, Sifan Wang, Yan Feng, Peng Pei, and Pheng- Ann Heng. Thinking-while-generating: Interleaving tex- tual reasoning throughout visual generation.arXiv preprint arXiv:2511.16671, 2025. 1

work page arXiv 2025

[30] [30]

Can we generate images with cot? let’s verify and reinforce image generation step by step,

Ziyu Guo, Renrui Zhang, Chengzhuo Tong, Zhizheng Zhao, Rui Huang, Haoquan Zhang, Manyuan Zhang, Jiaming Liu, Shanghang Zhang, Peng Gao, et al. Can we generate images with cot? let’s verify and reinforce image generation step by step.arXiv preprint arXiv:2501.13926, 2025. 1

work page arXiv 2025

[31] [31]

Ella: Equip diffusion models with llm for en- hanced semantic alignment, 2024

Xiwei Hu, Rui Wang, Yixiao Fang, Bin Fu, Pei Cheng, and Gang Yu. Ella: Equip diffusion models with llm for en- hanced semantic alignment, 2024. 14

work page 2024

[32] [32]

Resolving multi-condition confusion for finetuning-free personalized image generation, 2024

Qihan Huang, Siming Fu, Jinlong Liu, Hao Jiang, Yipeng Yu, and Jie Song. Resolving multi-condition confusion for finetuning-free personalized image generation, 2024. 1, 3

work page 2024

[33] [33]

T2i-r1: Reinforcing image generation with collaborative semantic-level and token-level cot.arXiv preprint arXiv:2505.00703, 2025

Dongzhi Jiang, Ziyu Guo, Renrui Zhang, Zhuofan Zong, Hao Li, Le Zhuo, Shilin Yan, Pheng-Ann Heng, and Hong- sheng Li. T2i-r1: Reinforcing image generation with col- laborative semantic-level and token-level cot.arXiv preprint arXiv:2505.00703, 2025. 1

work page arXiv 2025

[34] [34]

Berg, Wan-Yen Lo, Piotr Doll ´ar, and Ross Girshick

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer White- head, Alexander C. Berg, Wan-Yen Lo, Piotr Doll ´ar, and Ross Girshick. Segment anything, 2023. 1, 2, 3

work page 2023

[35] [35]

Viescore: Towards explainable metrics for conditional image synthesis evaluation, 2024

Max Ku, Dongfu Jiang, Cong Wei, Xiang Yue, and Wenhu Chen. Viescore: Towards explainable metrics for conditional image synthesis evaluation, 2024. 7

work page 2024

[36] [36]

Harold W. Kuhn. The hungarian method for the assignment problem.Naval Research Logistics (NRL), 52, 1955. 6

work page 1955

[37] [37]

Flux.1 – official inference repository for flux models.https : / / github

Black Forest Labs. Flux.1 – official inference repository for flux models.https : / / github . com / black - forest-labs/flux, 2024. Accessed: 2025-10-18. 1

work page 2024

[38] [38]

Flux.1 kontext: Flow matching for in-context image generation and editing in latent space,

Black Forest Labs, Stephen Batifol, Andreas Blattmann, Frederic Boesel, Saksham Consul, Cyril Diagne, Tim Dock- horn, Jack English, Zion English, Patrick Esser, Sumith Ku- lal, Kyle Lacey, Yam Levi, Cheng Li, Dominik Lorenz, Jonas M¨uller, Dustin Podell, Robin Rombach, Harry Saini, Axel Sauer, and Luke Smith. Flux.1 kontext: Flow matching for in-context i...

work page

[39] [39]

Photomaker: Customizing re- alistic human photos via stacked id embedding, 2023

Zhen Li, Mingdeng Cao, Xintao Wang, Zhongang Qi, Ming- Ming Cheng, and Ying Shan. Photomaker: Customizing re- alistic human photos via stacked id embedding, 2023. 1

work page 2023

[40] [40]

Hunyuan-dit: A powerful multi-resolution diffusion trans- former with fine-grained chinese understanding, 2024

Zhimin Li, Jianwei Zhang, Qin Lin, Jiangfeng Xiong, Yanxin Long, Xinchi Deng, Yingfang Zhang, Xingchao Liu, Minbin Huang, Zedong Xiao, Dayou Chen, Jiajun He, Jiahao Li, Wenyue Li, Chen Zhang, Rongwei Quan, Jianxiang Lu, Jiabin Huang, Xiaoyan Yuan, Xiaoxiao Zheng, Yixuan Li, Ji- hong Zhang, Chao Zhang, Meng Chen, Jie Liu, Zheng Fang, Weiyan Wang, Jinbao Xu...

work page 2024

[41] [41]

Scale-aware modulation meet transformer

Weifeng Lin, Ziheng Wu, Jiayu Chen, Jun Huang, and Lian- wen Jin. Scale-aware modulation meet transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 6015–6026, 2023. 12

work page 2023

[42] [42]

Draw-and-understand: Leveraging visual prompts to enable mllms to comprehend what you want.arXiv preprint arXiv:2403.20271,

Weifeng Lin, Xinyu Wei, Ruichuan An, Peng Gao, Bocheng Zou, Yulin Luo, Siyuan Huang, Shanghang Zhang, and Hongsheng Li. Draw-and-understand: Leveraging visual prompts to enable mllms to comprehend what you want. arXiv preprint arXiv:2403.20271, 2024. 12

work page arXiv 2024

[43] [43]

Pixwizard: Versatile image-to-image visual assis- 15 tant with open-language instructions,

Weifeng Lin, Xinyu Wei, Renrui Zhang, Le Zhuo, Shitian Zhao, Siyuan Huang, Huan Teng, Junlin Xie, Yu Qiao, Peng Gao, et al. Pixwizard: Versatile image-to-image visual assistant with open-language instructions.arXiv preprint arXiv:2409.15278, 2024. 1

work page arXiv 2024

[44] [44]

Perceive anything: Recog- nize, explain, caption, and segment anything in images and videos.arXiv preprint arXiv:2506.05302, 2025

Weifeng Lin, Xinyu Wei, Ruichuan An, Tianhe Ren, Tingwei Chen, Renrui Zhang, Ziyu Guo, Wentao Zhang, Lei Zhang, and Hongsheng Li. Perceive anything: Recognize, explain, caption, and segment anything in images and videos.arXiv preprint arXiv:2506.05302, 2025. 12

work page arXiv 2025

[45] [45]

Grounding dino: Marry- ing dino with grounded pre-training for open-set object de- tection, 2024

Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, and Lei Zhang. Grounding dino: Marry- ing dino with grounded pre-training for open-set object de- tection, 2024. 1, 2, 3

work page 2024

[46] [46]

Step1x-edit: A practical framework for gen- eral image editing, 2025

Shiyu Liu, Yucheng Han, Peng Xing, Fukun Yin, Rui Wang, Wei Cheng, Jiaqi Liao, Yingming Wang, Honghao Fu, Chun- rui Han, Guopeng Li, Yuang Peng, Quan Sun, Jingwei Wu, 22 Yan Cai, Zheng Ge, Ranchen Ming, Lei Xia, Xianfang Zeng, Yibo Zhu, Binxing Jiao, Xiangyu Zhang, Gang Yu, and Daxin Jiang. Step1x-edit: A practical framework for gen- eral image editing, 2025. 1

work page 2025

[47] [47]

Llm as dataset ana- lyst: Subpopulation structure discovery with large language model

Yulin Luo, Ruichuan An, Bocheng Zou, Yiming Tang, Ji- aming Liu, and Shanghang Zhang. Llm as dataset ana- lyst: Subpopulation structure discovery with large language model. InEuropean Conference on Computer Vision, pages 235–252. Springer, 2024. 12

work page 2024

[48] [48]

Subject- diffusion:open domain personalized text-to-image genera- tion without test-time fine-tuning, 2024

Jian Ma, Junhao Liang, Chen Chen, and Haonan Lu. Subject- diffusion:open domain personalized text-to-image genera- tion without test-time fine-tuning, 2024. 1, 3

work page 2024

[49] [49]

T2i- adapter: Learning adapters to dig out more controllable abil- ity for text-to-image diffusion models, 2023

Chong Mou, Xintao Wang, Liangbin Xie, Yanze Wu, Jian Zhang, Zhongang Qi, Ying Shan, and Xiaohu Qie. T2i- adapter: Learning adapters to dig out more controllable abil- ity for text-to-image diffusion models, 2023. 1

work page 2023

[50] [50]

Dreamo: A unified framework for image cus- tomization, 2025

Chong Mou, Yanze Wu, Wenxu Wu, Zinan Guo, Pengze Zhang, Yufeng Cheng, Yiming Luo, Fei Ding, Shiwen Zhang, Xinghui Li, Mengtian Li, Mingcong Liu, Yi Zhang, Shaojin Wu, Songtao Zhao, Jian Zhang, Qian He, and Xin- glong Wu. Dreamo: A unified framework for image cus- tomization, 2025. 1, 3

work page 2025

[51] [51]

Dall·e 3.https://openai.com/index/ dall-e-3/, 2023

OpenAI. Dall·e 3.https://openai.com/index/ dall-e-3/, 2023. Accessed: 2025-10-14. 1

work page 2023

[52] [52]

Gpt-4o image generation.https : / / openai

OpenAI. Gpt-4o image generation.https : / / openai . com / index / introducing - 4o - image - generation/, 2025. Accessed: 2025-10-18. 1, 2, 10, 12, 13, 14

work page 2025

[53] [53]

Dinov2: Learning robust visual features with- out supervision, 2024

Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mah- moud Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Herv ´e Je- gou, Julien Mairal, ...

work page 2024

[54] [54]

Conceptbed: Evaluating concept learning abilities of text-to-image diffusion models, 2024

Maitreya Patel, Tejas Gokhale, Chitta Baral, and Yezhou Yang. Conceptbed: Evaluating concept learning abilities of text-to-image diffusion models, 2024. 1

work page 2024

[55] [55]

Maitreya Patel, Sangmin Jung, Chitta Baral, and Yezhou Yang.λ-eclipse: Multi-concept personalized text-to-image diffusion models by leveraging clip latent space, 2024. 1, 3

work page 2024

[56] [56]

Sdxl: Improving latent diffusion models for high-resolution image synthesis, 2023

Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas M ¨uller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis, 2023. 1

work page 2023

[57] [57]

Unicontrol: A unified diffusion model for controllable visual generation in the wild, 2023

Can Qin, Shu Zhang, Ning Yu, Yihao Feng, Xinyi Yang, Yingbo Zhou, Huan Wang, Juan Carlos Niebles, Caiming Xiong, Silvio Savarese, Stefano Ermon, Yun Fu, and Ran Xu. Unicontrol: A unified diffusion model for controllable visual generation in the wild, 2023. 1

work page 2023

[58] [58]

Qwen-image-edit-2509.https : / / huggingface

QwenLM. Qwen-image-edit-2509.https : / / huggingface . co / Qwen / Qwen - Image - Edit - 2509, 2025. Accessed: 2025-11-12. 3, 8, 15

work page 2025

[59] [59]

Sam 2: Segment anything in images and videos,

Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman R¨adle, Chloe Rolland, Laura Gustafson, Eric Mintun, Junt- ing Pan, Kalyan Vasudev Alwala, Nicolas Carion, Chao- Yuan Wu, Ross Girshick, Piotr Doll´ar, and Christoph Feicht- enhofer. Sam 2: Segment anything in images and videos,

work page

[60] [60]

High-resolution image syn- thesis with latent diffusion models, 2022

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj¨orn Ommer. High-resolution image syn- thesis with latent diffusion models, 2022. 1

work page 2022

[61] [61]

Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation, 2023

Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation, 2023. 1

work page 2023

[62] [62]

Hyperdreambooth: Hypernetworks for fast personalization of text-to-image models, 2024

Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Wei Wei, Tingbo Hou, Yael Pritch, Neal Wadhwa, Michael Rubinstein, and Kfir Aberman. Hyperdreambooth: Hypernetworks for fast personalization of text-to-image models, 2024. 1

work page 2024

[63] [63]

Seedream 4.0: Toward next-generation multimodal image generation, 2025

Team Seedream, Yunpeng Chen, Yu Gao, Lixue Gong, Meng Guo, Qiushan Guo, Zhiyao Guo, Xiaoxia Hou, Weilin Huang, Yixuan Huang, Xiaowen Jian, Huafeng Kuang, Zhichao Lai, Fanshi Li, Liang Li, Xiaochen Lian, Chao Liao, Liyang Liu, Wei Liu, Yanzuo Lu, Zhengxiong Luo, Tong- tong Ou, Guang Shi, Yichun Shi, Shiqi Sun, Yu Tian, Zhi Tian, Peng Wang, Rui Wang, Xun Wa...

work page 2025

[64] [64]

Addendum to gpt-4o system card: Native image capabilities

Jerry Sima, Eric Cheng, William Fedus, Miles Brundage, Mark Chen, Iason Gabriel, Sandhini Agarwal, Lilian Weng, et al. Addendum to gpt-4o system card: Native image capabilities. https : / / www . semanticscholar . org / paper / 0c9b799e0dde7dcbe42f8dc61b242a0106739eba,

work page

[65] [65]

Accessed: 2025-10-18. 1

work page 2025

[66] [66]

Oriane Sim ´eoni, Huy V . V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Micha ¨el Ramamonjisoa, Francisco Massa, Daniel Haziza, Luca Wehrstedt, Jianyuan Wang, Timoth´ee Darcet, Th´eo Moutakanni, Leonel Sentana, Claire Roberts, Andrea Vedaldi, Jamie Tolan, John Brandt, Camille Couprie,...

work page 2025

[67] [67]

Insert anything: Image insertion via in-context editing in dit, 2025

Wensong Song, Hong Jiang, Zongxing Yang, Ruijie Quan, and Yi Yang. Insert anything: Image insertion via in-context editing in dit, 2025. 1

work page 2025

[68] [68]

Ominicontrol: Minimal and universal control for diffusion transformer, 2025

Zhenxiong Tan, Songhua Liu, Xingyi Yang, Qiaochu Xue, and Xinchao Wang. Ominicontrol: Minimal and universal control for diffusion transformer, 2025. 3, 10

work page 2025

[69] [69]

Gpt-4o system card, 2024

OpenAI Team. Gpt-4o system card, 2024. 4, 6, 7, 13

work page 2024

[70] [70]

Id-booth: Identity- consistent face generation with diffusion models

Darian Toma ˇsevi´c, Fadi Boutros, Chenhao Lin, Naser Damer, Vitomir ˇStruc, and Peter Peer. Id-booth: Identity- consistent face generation with diffusion models. In2025 IEEE 19th International Conference on Automatic Face and Gesture Recognition (FG), page 1–10. IEEE, 2025. 1

work page 2025

[71] [71]

Delving into rl for image generation with cot: A study on dpo vs

Chengzhuo Tong, Ziyu Guo, Renrui Zhang, Wenyu Shan, Xinyu Wei, Zhenghao Xing, Hongsheng Li, and Pheng-Ann 23 Heng. Delving into rl for image generation with cot: A study on dpo vs. grpo.arXiv preprint arXiv:2505.17017, 2025. 12

work page arXiv 2025

[72] [72]

Delving into rl for image generation with cot: A study on dpo vs

Chengzhuo Tong, Ziyu Guo, Renrui Zhang, Wenyu Shan, Xinyu Wei, Zhenghao Xing, Hongsheng Li, and Pheng-Ann Heng. Delving into rl for image generation with cot: A study on dpo vs. grpo, 2025. 1

work page 2025

[73] [73]

Siglip 2: Multilingual vision- language encoders with improved semantic understanding, localization, and dense features, 2025

Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muham- mad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, Olivier H ´enaff, Jeremiah Harmsen, Andreas Steiner, and Xiaohua Zhai. Siglip 2: Multilingual vision- language encoders with improved semantic understanding, localization, and dense feature...

work page 2025

[74] [74]

Mulan: A multi layer annotated dataset for controllable text-to-image generation, 2024

Petru-Daniel Tudosiu, Yongxin Yang, Shifeng Zhang, Fei Chen, Steven McDonagh, Gerasimos Lampouras, Ignacio Ia- cobacci, and Sarah Parisot. Mulan: A multi layer annotated dataset for controllable text-to-image generation, 2024. 3, 10

work page 2024

[75] [75]

Cloud- device collaborative learning for multimodal large language models

Guanqun Wang, Jiaming Liu, Chenxuan Li, Yuan Zhang, Junpeng Ma, Xinyu Wei, Kevin Zhang, Maurice Chong, Renrui Zhang, Yijiang Liu, and Shanghang Zhang. Cloud- device collaborative learning for multimodal large language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 12646–12655, 2024. 12

work page 2024

[76] [76]

Mr-mllm: Mutual reinforcement of multi- modal comprehension and vision perception.arXiv preprint arXiv:2406.15768, 2024

Guanqun Wang, Xinyu Wei, Jiaming Liu, Ray Zhang, Yichi Zhang, Kevin Zhang, Maurice Chong, and Shang- hang Zhang. Mr-mllm: Mutual reinforcement of multi- modal comprehension and vision perception.arXiv preprint arXiv:2406.15768, 2024. 12

work page arXiv 2024

[77] [77]

Skywork unipic: Unified autoregres- sive modeling for visual understanding and generation, 2025

Peiyu Wang, Yi Peng, Yimeng Gan, Liang Hu, Tianyidan Xie, Xiaokun Wang, Yichen Wei, Chuanxin Tang, Bo Zhu, Changshi Li, Hongyang Wei, Eric Li, Xuchen Song, Yang Liu, and Yahui Zhou. Skywork unipic: Unified autoregres- sive modeling for visual understanding and generation, 2025. 1

work page 2025

[78] [78]

Seededit 3.0: Fast and high-quality generative image editing,

Peng Wang, Yichun Shi, Xiaochen Lian, Zhonghua Zhai, Xin Xia, Xuefeng Xiao, Weilin Huang, and Jianchao Yang. Seededit 3.0: Fast and high-quality generative image editing,

work page

[79] [79]

Instantid: Zero- shot identity-preserving generation in seconds, 2024

Qixun Wang, Xu Bai, Haofan Wang, Zekui Qin, Anthony Chen, Huaxia Li, Xu Tang, and Yao Hu. Instantid: Zero- shot identity-preserving generation in seconds, 2024. 1

work page 2024

[80] [80]

Ms-diffusion: Multi-subject zero-shot image per- sonalization with layout guidance, 2025

Xierui Wang, Siming Fu, Qihan Huang, Wanggui He, and Hao Jiang. Ms-diffusion: Multi-subject zero-shot image per- sonalization with layout guidance, 2025. 1, 3

work page 2025