MICo-150K: A Comprehensive Dataset Advancing Multi-Image Composition
Pith reviewed 2026-05-17 00:16 UTC · model grok-4.3
The pith
A 150K dataset of multi-image compositions trains AI models to produce coherent outputs from arbitrary numbers of reference images.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper claims that MICo-150K, built by categorizing multi-image composition into seven tasks, synthesizing balanced composites with proprietary models, and refining them via human-in-the-loop processes, plus an 11K real-image De&Re subset, supplies the missing training resource. Models fine-tuned on it acquire or improve multi-image composition, as shown by Qwen-MICo matching Qwen-Image-2509 on three-image tasks while supporting arbitrary inputs.
What carries the argument
MICo-150K dataset with its seven-task synthesis pipeline, human refinement, and De&Re real-image subset, paired with MICo-Bench and the Weighted-Ref-VIEScore metric.
If this is right
- Models without multi-image composition ability gain this capability after fine-tuning on MICo-150K.
- Models that already possess some composition skill show measurable further gains.
- The resulting baseline supports arbitrary numbers of input images rather than being restricted to a fixed count.
- Standardized evaluation across seven tasks and 300 challenging real-derived cases becomes available through MICo-Bench.
- The Weighted-Ref-VIEScore metric supplies a MICo-specific way to score identity preservation and overall quality.
Where Pith is reading between the lines
- The dataset may support downstream uses such as photo editing tools that combine user-provided references into new scenes.
- Extending the decomposition-recomposition method to video frames could help maintain consistency across time.
- Testing the trained models on entirely user-supplied real photos without further adaptation would reveal practical limits.
- Larger-scale versions of the same curation process might narrow remaining gaps to human-level composition performance.
Load-bearing premise
Images synthesized by proprietary models and refined by humans supply high-quality, identity-consistent training examples that generalize to real-world multi-image composition.
What would settle it
Testing the fine-tuned models on a fresh set of human-composed real photographs outside the synthetic distribution and observing a large drop in identity consistency or coherence compared with MICo-Bench results would falsify generalization.
Figures
read the original abstract
In controllable image generation, synthesizing coherent and consistent images from multiple reference inputs, i.e., Multi-Image Composition (MICo), remains a challenging problem, partly hindered by the lack of high-quality training data. To bridge this gap, we conduct a systematic study of MICo, categorizing it into 7 representative tasks and curate a large-scale collection of high-quality source images and construct diverse MICo prompts. Leveraging powerful proprietary models, we synthesize a rich amount of balanced composite images, followed by human-in-the-loop filtering and refinement, resulting in MICo-150K, a comprehensive dataset for MICo with identity consistency. We further build a Decomposition-and-Recomposition (De&Re) subset, where 11K real-world complex images are decomposed into components and recomposed, enabling both real and synthetic compositions. To enable comprehensive evaluation, we construct MICo-Bench with 100 cases per task and 300 challenging De&Re cases, and further introduce a new metric, Weighted-Ref-VIEScore, specifically tailored for MICo evaluation. Finally, we fine-tune multiple models on MICo-150K and evaluate them on MICo-Bench. The results show that MICo-150K effectively equips models without MICo capability and further enhances those with existing skills. Notably, our baseline model, Qwen-MICo, fine-tuned from Qwen-Image-Edit, matches Qwen-Image-2509 in 3-image composition while supporting arbitrary multi-image inputs beyond the latter's limitation. Our dataset, benchmark, and baseline collectively offer valuable resources for further research on Multi-Image Composition.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces MICo-150K, a 150K-example dataset for multi-image composition (MICo) in controllable image generation. It categorizes MICo into 7 tasks, synthesizes balanced composites via proprietary models with human-in-the-loop filtering and refinement, and includes an 11K-example Decomposition-and-Recomposition (De&Re) subset derived from real-world images. The work also presents MICo-Bench (100 cases per task plus 300 De&Re cases) and the Weighted-Ref-VIEScore metric, then demonstrates via fine-tuning that models such as Qwen-MICo (from Qwen-Image-Edit) acquire or improve MICo skills, matching Qwen-Image-2509 on 3-image composition while supporting arbitrary numbers of inputs.
Significance. If the reported gains hold under rigorous scrutiny, the dataset and benchmark would address a clear data scarcity issue in identity-consistent multi-image synthesis, providing community resources that mix synthetic scale with real-image grounding via De&Re. The explicit support for arbitrary input counts beyond fixed baselines is a practical advance for controllable generation pipelines.
major comments (2)
- [§5 (Experiments)] §5 (Experiments): The headline claim that Qwen-MICo matches Qwen-Image-2509 on 3-image composition while handling arbitrary inputs is load-bearing for the assertion that MICo-150K 'effectively equips models.' No numerical scores, standard deviations, or full baseline tables are referenced in the evaluation summary, preventing assessment of effect size or statistical reliability.
- [§3 (Dataset construction)] §3 (Dataset construction) and §5 (Evaluation): Only 11K of the 150K examples come from real De&Re images; the remainder are proprietary-model synthetics after human filtering. No quantitative identity-consistency metrics (e.g., face-similarity scores or CLIP-based consistency) are reported on held-out real photographs, leaving the transfer assumption from synthetic to real multi-image scenarios untested and central to the generalization claim.
minor comments (2)
- [§4 (Benchmark and Metric)] The formal definition and weighting scheme of the new Weighted-Ref-VIEScore metric should be stated explicitly with an equation or pseudocode to allow independent reproduction.
- [§3 (Dataset construction)] Filtering criteria and prompt templates used during human-in-the-loop refinement are described at high level; a supplementary table listing exact rejection rates or inter-annotator agreement would improve reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment point by point below, indicating where revisions have been made to strengthen the manuscript.
read point-by-point responses
-
Referee: [§5 (Experiments)] §5 (Experiments): The headline claim that Qwen-MICo matches Qwen-Image-2509 on 3-image composition while handling arbitrary inputs is load-bearing for the assertion that MICo-150K 'effectively equips models.' No numerical scores, standard deviations, or full baseline tables are referenced in the evaluation summary, preventing assessment of effect size or statistical reliability.
Authors: We agree that the evaluation summary in §5 should explicitly reference quantitative results to support the headline claim. The detailed comparisons, including scores on 3-image composition and arbitrary input counts, along with standard deviations, appear in the full experimental tables. We have revised §5 to include key numerical values from those tables and to direct readers to the complete baseline results for proper assessment of effect sizes. revision: yes
-
Referee: [§3 (Dataset construction)] §3 (Dataset construction) and §5 (Evaluation): Only 11K of the 150K examples come from real De&Re images; the remainder are proprietary-model synthetics after human filtering. No quantitative identity-consistency metrics (e.g., face-similarity scores or CLIP-based consistency) are reported on held-out real photographs, leaving the transfer assumption from synthetic to real multi-image scenarios untested and central to the generalization claim.
Authors: The referee is correct that the majority of examples are synthetic and that explicit quantitative identity-consistency metrics on held-out real photographs were not reported. The De&Re subset provides real-image grounding, and human filtering was applied for quality control. To address the transfer assumption, we have added new quantitative evaluations (face-similarity and CLIP consistency) on held-out real images to the revised §5. revision: yes
Circularity Check
No significant circularity in derivation or claims
full rationale
The paper's core contribution is a data-curation pipeline that synthesizes composites via proprietary models, applies human-in-the-loop filtering, and constructs a separate benchmark (MICo-Bench) with real De&Re cases. Reported results consist of empirical fine-tuning outcomes on this dataset evaluated against the benchmark and external models (Qwen-Image-Edit, Qwen-Image-2509). No equations, predictions, or central claims reduce by construction to fitted parameters from the same data, no self-citations bear load-bearing weight on uniqueness or ansatzes, and no renaming of known results occurs. The chain remains a standard empirical dataset-plus-benchmark workflow that is self-contained against external model comparisons.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Proprietary models can generate high-quality, identity-consistent composite images that become suitable training data after human-in-the-loop filtering and refinement.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Leveraging powerful proprietary models, we synthesize a rich amount of balanced composite images, followed by human-in-the-loop filtering and refinement, resulting in MICo-150K
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We further build a Decomposition-and-Recomposition (De&Re) subset, where 11K real-world complex images are decomposed into components and recomposed
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 4 Pith papers
-
UniCustom: Unified Visual Conditioning for Multi-Reference Image Generation
UniCustom fuses ViT and VAE features before VLM encoding and uses two-stage training plus slot-wise regularization to improve subject consistency in multi-reference diffusion-based image generation.
-
Banana100: Breaking NR-IQA Metrics by 100 Iterative Image Replications with Nano Banana Pro
Banana100 dataset shows that none of 21 popular NR-IQA metrics consistently rate images degraded by 100 iterative edits lower than clean originals.
-
UniCustom: Unified Visual Conditioning for Multi-Reference Image Generation
A unified visual conditioning approach fuses semantic and appearance features before VLM processing, with two-stage training and slot-wise regularization, to improve consistency in multi-reference image generation.
-
LLaDA2.0-Uni: Unifying Multimodal Understanding and Generation with Diffusion Large Language Model
LLaDA2.0-Uni unifies multimodal understanding and generation inside one discrete diffusion large language model with a semantic tokenizer, MoE backbone, and diffusion decoder.
Reference graph
Works this paper leans on
-
[1]
Mc-llava: Multi-concept personalized vision-language model.arXiv preprint arXiv:2411.11706, 2024
Ruichuan An, Sihan Yang, Ming Lu, Renrui Zhang, Kai Zeng, Yulin Luo, Jiajun Cao, Hao Liang, Ying Chen, Qi She, et al. Mc-llava: Multi-concept personalized vision-language model.arXiv preprint arXiv:2411.11706, 2024. 12
-
[2]
Ruichuan An, Sihan Yang, Renrui Zhang, Zijun Shen, Ming Lu, Gaole Dai, Hao Liang, Ziyu Guo, Shilin Yan, Yulin Luo, et al. Unictokens: Boosting personalized understand- ing and generation via unified concept tokens.arXiv preprint arXiv:2505.14671, 2025. 12
-
[3]
Qwen2.5-vl technical report, 2025
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhao- hai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report,...
work page 2025
-
[4]
headshot istockphoto: Headshot image dataset from istockphoto.https : / / huggingface
BKM1804. headshot istockphoto: Headshot image dataset from istockphoto.https : / / huggingface . co / datasets / BKM1804 / headshot _ istockphoto,
-
[5]
Accessed: 2025-11-05, 6 000 images. 3, 10
work page 2025
-
[6]
headshot pexels v1: High-resolution headshot dataset from pexels.https : / / huggingface
BKM1804. headshot pexels v1: High-resolution headshot dataset from pexels.https : / / huggingface . co / datasets/BKM1804/headshot_pexels_v1, 2025. Accessed: 2025-11-05, includes 3 000 images. 3, 10
work page 2025
-
[7]
Tim Brooks, Aleksander Holynski, and Alexei A. Efros. In- structpix2pix: Learning to follow image editing instructions,
-
[8]
Hunyuanimage 3.0 technical report,
Siyu Cao, Hangting Chen, Peng Chen, Yiji Cheng, Yu- tao Cui, Xinchi Deng, Ying Dong, Kipper Gong, Tianpeng Gu, Xiusen Gu, Tiankai Hang, Duojun Huang, Jie Jiang, Zhengkai Jiang, Weijie Kong, Changlin Li, Donghao Li, Jun- zhe Li, Xin Li, Yang Li, Zhenxi Li, Zhimin Li, Jiaxin Lin, Li- nus, Lucaz Liu, Shu Liu, Songtao Liu, Yu Liu, Yuhong Liu, Yanxin Long, Fan...
-
[9]
Soravit Changpinyo, Piyush Sharma, Nan Ding, and Radu Soricut. Conceptual 12m: Pushing web-scale image-text pre- training to recognize long-tail visual concepts, 2021. 6, 10
work page 2021
-
[10]
Bowen Chen, Mengyi Zhao, Haomiao Sun, Li Chen, Xu Wang, Kang Du, and Xinglong Wu. Xverse: Consistent multi-subject control of identity and semantic attributes via dit modulation, 2025. 1, 3
work page 2025
-
[11]
Hong Chen, Yipeng Zhang, Simin Wu, Xin Wang, Xuguang Duan, Yuwei Zhou, and Wenwu Zhu. Disenbooth: Identity- preserving disentangled tuning for subject-driven text-to- image generation, 2024. 1
work page 2024
-
[12]
Pixart-α: Fast training of dif- fusion transformer for photorealistic text-to-image synthesis,
Junsong Chen, Jincheng Yu, Chongjian Ge, Lewei Yao, Enze Xie, Yue Wu, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, and Zhenguo Li. Pixart-α: Fast training of dif- fusion transformer for photorealistic text-to-image synthesis,
-
[13]
Blip3- o: A family of fully open unified multimodal models- architecture, training and dataset, 2025
Jiuhai Chen, Zhiyang Xu, Xichen Pan, Yushi Hu, Can Qin, Tom Goldstein, Lifu Huang, Tianyi Zhou, Saining Xie, Sil- vio Savarese, Le Xue, Caiming Xiong, and Ran Xu. Blip3- o: A family of fully open unified multimodal models- architecture, training and dataset, 2025. 2, 8, 11, 13
work page 2025
-
[14]
Blip3o-next: Next frontier of native image generation, 2025
Jiuhai Chen, Le Xue, Zhiyang Xu, Xichen Pan, Shusheng Yang, Can Qin, An Yan, Honglu Zhou, Zeyuan Chen, Lifu Huang, Tianyi Zhou, Junnan Li, Silvio Savarese, Caiming Xiong, and Ran Xu. Blip3o-next: Next frontier of native image generation, 2025. 8
work page 2025
-
[15]
Yuzhuo Chen, Zehua Ma, Jianhua Wang, Kai Kang, Shunyu Yao, and Weiming Zhang. Lamic: Layout-aware multi- image composition via scalability of multimodal diffusion transformer, 2025. 1, 3
work page 2025
-
[16]
Viton-hd: High-resolution virtual try-on via misalignment-aware normalization, 2021
Seunghwan Choi, Sunghyun Park, Minsoo Lee, and Jaegul Choo. Viton-hd: High-resolution virtual try-on via misalignment-aware normalization, 2021. 3, 10
work page 2021
-
[17]
Idadapter: Learn- ing mixed features for tuning-free personalization of text-to- image models
Siying Cui, Jia Guo, Xiang An, Jiankang Deng, Yongle Zhao, Xinyu Wei, and Ziyong Feng. Idadapter: Learn- ing mixed features for tuning-free personalization of text-to- image models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Work- shops, pages 950–959, 2024. 1
work page 2024
-
[18]
Idadapter: Learn- ing mixed features for tuning-free personalization of text-to- image models, 2024
Siying Cui, Jia Guo, Xiang An, Jiankang Deng, Yongle Zhao, Xinyu Wei, and Ziyong Feng. Idadapter: Learn- ing mixed features for tuning-free personalization of text-to- image models, 2024. 1
work page 2024
-
[19]
Introducing gemini 1.5 flash: Fast, efficient, and multimodal.https : / / developers
Google DeepMind. Introducing gemini 1.5 flash: Fast, efficient, and multimodal.https : / / developers . googleblog.com/en/introducing- gemini- 2- 5-flash-image/, 2024. Accessed: 2025-10-18. 1, 2, 4, 5, 7, 10, 12, 13, 14
work page 2024
-
[20]
Emerging Properties in Unified Multimodal Pretraining
Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, Guang Shi, and Haoqi Fan. Emerging properties in unified multimodal pretraining.arXiv preprint arXiv:2505.14683, 2025. 2, 8, 11, 13
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[21]
Emerging properties in unified multimodal pretraining, 2025
Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, Guang Shi, and Haoqi Fan. Emerging properties in unified multimodal pretraining, 2025. 1
work page 2025
-
[22]
Jiankang Deng, Jia Guo, Jing Yang, Niannan Xue, Irene Kot- sia, and Stefanos Zafeiriou. Arcface: Additive angular mar- gin loss for deep face recognition.IEEE Transactions on Pat- tern Analysis and Machine Intelligence, 44(10):5962–5979,
-
[23]
Scaling rectified flow trans- formers for high-resolution image synthesis, 2024
Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas M ¨uller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, Dustin Podell, Tim 21 Dockhorn, Zion English, Kyle Lacey, Alex Goodwin, Yan- nik Marek, and Robin Rombach. Scaling rectified flow trans- formers for high-resolution image synthesis, 2024. 1
work page 2024
-
[24]
A density-based algorithm for discovering clusters in large spatial databases with noise
Martin Ester, Hans-Peter Kriegel, J ¨org Sander, and Xiaowei Xu. A density-based algorithm for discovering clusters in large spatial databases with noise. InKnowledge Discovery and Data Mining, 1996. 3
work page 1996
-
[25]
Seedream 3.0 technical report, 2025
Yu Gao, Lixue Gong, Qiushan Guo, Xiaoxia Hou, Zhichao Lai, Fanshi Li, Liang Li, Xiaochen Lian, Chao Liao, Liyang Liu, Wei Liu, Yichun Shi, Shiqi Sun, Yu Tian, Zhi Tian, Peng Wang, Rui Wang, Xuanda Wang, Xun Wang, Ye Wang, Guofeng Wu, Jie Wu, Xin Xia, Xuefeng Xiao, Zhonghua Zhai, Xinyu Zhang, Qi Zhang, Yuwei Zhang, Shijia Zhao, Jianchao Yang, and Weilin Hu...
work page 2025
-
[26]
See- dream 2.0: A native chinese-english bilingual image genera- tion foundation model, 2025
Lixue Gong, Xiaoxia Hou, Fanshi Li, Liang Li, Xiaochen Lian, Fei Liu, Liyang Liu, Wei Liu, Wei Lu, Yichun Shi, Shiqi Sun, Yu Tian, Zhi Tian, Peng Wang, Xun Wang, Ye Wang, Guofeng Wu, Jie Wu, Xin Xia, Xuefeng Xiao, Lin- jie Yang, Zhonghua Zhai, Xinyu Zhang, Qi Zhang, Yuwei Zhang, Shijia Zhao, Jianchao Yang, and Weilin Huang. See- dream 2.0: A native chines...
work page 2025
-
[27]
Pulid: Pure and lightning id customiza- tion via contrastive alignment, 2024
Zinan Guo, Yanze Wu, Zhuowei Chen, Lang Chen, Peng Zhang, and Qian He. Pulid: Pure and lightning id customiza- tion via contrastive alignment, 2024. 1, 3
work page 2024
-
[28]
Ziyu Guo, Xinyan Chen, Renrui Zhang, Ruichuan An, Yu Qi, Dongzhi Jiang, Xiangtai Li, Manyuan Zhang, Hongsheng Li, and Pheng-Ann Heng. Are video models ready as zero-shot reasoners? an empirical study with the mme-cof benchmark. arXiv preprint arXiv:2510.26802, 2025. 14
-
[29]
Ziyu Guo, Renrui Zhang, Hongyu Li, Manyuan Zhang, Xinyan Chen, Sifan Wang, Yan Feng, Peng Pei, and Pheng- Ann Heng. Thinking-while-generating: Interleaving tex- tual reasoning throughout visual generation.arXiv preprint arXiv:2511.16671, 2025. 1
-
[30]
Can we generate images with cot? let’s verify and reinforce image generation step by step,
Ziyu Guo, Renrui Zhang, Chengzhuo Tong, Zhizheng Zhao, Rui Huang, Haoquan Zhang, Manyuan Zhang, Jiaming Liu, Shanghang Zhang, Peng Gao, et al. Can we generate images with cot? let’s verify and reinforce image generation step by step.arXiv preprint arXiv:2501.13926, 2025. 1
-
[31]
Ella: Equip diffusion models with llm for en- hanced semantic alignment, 2024
Xiwei Hu, Rui Wang, Yixiao Fang, Bin Fu, Pei Cheng, and Gang Yu. Ella: Equip diffusion models with llm for en- hanced semantic alignment, 2024. 14
work page 2024
-
[32]
Resolving multi-condition confusion for finetuning-free personalized image generation, 2024
Qihan Huang, Siming Fu, Jinlong Liu, Hao Jiang, Yipeng Yu, and Jie Song. Resolving multi-condition confusion for finetuning-free personalized image generation, 2024. 1, 3
work page 2024
-
[33]
Dongzhi Jiang, Ziyu Guo, Renrui Zhang, Zhuofan Zong, Hao Li, Le Zhuo, Shilin Yan, Pheng-Ann Heng, and Hong- sheng Li. T2i-r1: Reinforcing image generation with col- laborative semantic-level and token-level cot.arXiv preprint arXiv:2505.00703, 2025. 1
-
[34]
Berg, Wan-Yen Lo, Piotr Doll ´ar, and Ross Girshick
Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer White- head, Alexander C. Berg, Wan-Yen Lo, Piotr Doll ´ar, and Ross Girshick. Segment anything, 2023. 1, 2, 3
work page 2023
-
[35]
Viescore: Towards explainable metrics for conditional image synthesis evaluation, 2024
Max Ku, Dongfu Jiang, Cong Wei, Xiang Yue, and Wenhu Chen. Viescore: Towards explainable metrics for conditional image synthesis evaluation, 2024. 7
work page 2024
-
[36]
Harold W. Kuhn. The hungarian method for the assignment problem.Naval Research Logistics (NRL), 52, 1955. 6
work page 1955
-
[37]
Flux.1 – official inference repository for flux models.https : / / github
Black Forest Labs. Flux.1 – official inference repository for flux models.https : / / github . com / black - forest-labs/flux, 2024. Accessed: 2025-10-18. 1
work page 2024
-
[38]
Flux.1 kontext: Flow matching for in-context image generation and editing in latent space,
Black Forest Labs, Stephen Batifol, Andreas Blattmann, Frederic Boesel, Saksham Consul, Cyril Diagne, Tim Dock- horn, Jack English, Zion English, Patrick Esser, Sumith Ku- lal, Kyle Lacey, Yam Levi, Cheng Li, Dominik Lorenz, Jonas M¨uller, Dustin Podell, Robin Rombach, Harry Saini, Axel Sauer, and Luke Smith. Flux.1 kontext: Flow matching for in-context i...
-
[39]
Photomaker: Customizing re- alistic human photos via stacked id embedding, 2023
Zhen Li, Mingdeng Cao, Xintao Wang, Zhongang Qi, Ming- Ming Cheng, and Ying Shan. Photomaker: Customizing re- alistic human photos via stacked id embedding, 2023. 1
work page 2023
-
[40]
Zhimin Li, Jianwei Zhang, Qin Lin, Jiangfeng Xiong, Yanxin Long, Xinchi Deng, Yingfang Zhang, Xingchao Liu, Minbin Huang, Zedong Xiao, Dayou Chen, Jiajun He, Jiahao Li, Wenyue Li, Chen Zhang, Rongwei Quan, Jianxiang Lu, Jiabin Huang, Xiaoyan Yuan, Xiaoxiao Zheng, Yixuan Li, Ji- hong Zhang, Chao Zhang, Meng Chen, Jie Liu, Zheng Fang, Weiyan Wang, Jinbao Xu...
work page 2024
-
[41]
Scale-aware modulation meet transformer
Weifeng Lin, Ziheng Wu, Jiayu Chen, Jun Huang, and Lian- wen Jin. Scale-aware modulation meet transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 6015–6026, 2023. 12
work page 2023
-
[42]
Weifeng Lin, Xinyu Wei, Ruichuan An, Peng Gao, Bocheng Zou, Yulin Luo, Siyuan Huang, Shanghang Zhang, and Hongsheng Li. Draw-and-understand: Leveraging visual prompts to enable mllms to comprehend what you want. arXiv preprint arXiv:2403.20271, 2024. 12
-
[43]
Pixwizard: Versatile image-to-image visual assis- 15 tant with open-language instructions,
Weifeng Lin, Xinyu Wei, Renrui Zhang, Le Zhuo, Shitian Zhao, Siyuan Huang, Huan Teng, Junlin Xie, Yu Qiao, Peng Gao, et al. Pixwizard: Versatile image-to-image visual assistant with open-language instructions.arXiv preprint arXiv:2409.15278, 2024. 1
-
[44]
Weifeng Lin, Xinyu Wei, Ruichuan An, Tianhe Ren, Tingwei Chen, Renrui Zhang, Ziyu Guo, Wentao Zhang, Lei Zhang, and Hongsheng Li. Perceive anything: Recognize, explain, caption, and segment anything in images and videos.arXiv preprint arXiv:2506.05302, 2025. 12
-
[45]
Grounding dino: Marry- ing dino with grounded pre-training for open-set object de- tection, 2024
Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, and Lei Zhang. Grounding dino: Marry- ing dino with grounded pre-training for open-set object de- tection, 2024. 1, 2, 3
work page 2024
-
[46]
Step1x-edit: A practical framework for gen- eral image editing, 2025
Shiyu Liu, Yucheng Han, Peng Xing, Fukun Yin, Rui Wang, Wei Cheng, Jiaqi Liao, Yingming Wang, Honghao Fu, Chun- rui Han, Guopeng Li, Yuang Peng, Quan Sun, Jingwei Wu, 22 Yan Cai, Zheng Ge, Ranchen Ming, Lei Xia, Xianfang Zeng, Yibo Zhu, Binxing Jiao, Xiangyu Zhang, Gang Yu, and Daxin Jiang. Step1x-edit: A practical framework for gen- eral image editing, 2025. 1
work page 2025
-
[47]
Llm as dataset ana- lyst: Subpopulation structure discovery with large language model
Yulin Luo, Ruichuan An, Bocheng Zou, Yiming Tang, Ji- aming Liu, and Shanghang Zhang. Llm as dataset ana- lyst: Subpopulation structure discovery with large language model. InEuropean Conference on Computer Vision, pages 235–252. Springer, 2024. 12
work page 2024
-
[48]
Jian Ma, Junhao Liang, Chen Chen, and Haonan Lu. Subject- diffusion:open domain personalized text-to-image genera- tion without test-time fine-tuning, 2024. 1, 3
work page 2024
-
[49]
Chong Mou, Xintao Wang, Liangbin Xie, Yanze Wu, Jian Zhang, Zhongang Qi, Ying Shan, and Xiaohu Qie. T2i- adapter: Learning adapters to dig out more controllable abil- ity for text-to-image diffusion models, 2023. 1
work page 2023
-
[50]
Dreamo: A unified framework for image cus- tomization, 2025
Chong Mou, Yanze Wu, Wenxu Wu, Zinan Guo, Pengze Zhang, Yufeng Cheng, Yiming Luo, Fei Ding, Shiwen Zhang, Xinghui Li, Mengtian Li, Mingcong Liu, Yi Zhang, Shaojin Wu, Songtao Zhao, Jian Zhang, Qian He, and Xin- glong Wu. Dreamo: A unified framework for image cus- tomization, 2025. 1, 3
work page 2025
-
[51]
Dall·e 3.https://openai.com/index/ dall-e-3/, 2023
OpenAI. Dall·e 3.https://openai.com/index/ dall-e-3/, 2023. Accessed: 2025-10-14. 1
work page 2023
-
[52]
Gpt-4o image generation.https : / / openai
OpenAI. Gpt-4o image generation.https : / / openai . com / index / introducing - 4o - image - generation/, 2025. Accessed: 2025-10-18. 1, 2, 10, 12, 13, 14
work page 2025
-
[53]
Dinov2: Learning robust visual features with- out supervision, 2024
Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mah- moud Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Herv ´e Je- gou, Julien Mairal, ...
work page 2024
-
[54]
Conceptbed: Evaluating concept learning abilities of text-to-image diffusion models, 2024
Maitreya Patel, Tejas Gokhale, Chitta Baral, and Yezhou Yang. Conceptbed: Evaluating concept learning abilities of text-to-image diffusion models, 2024. 1
work page 2024
-
[55]
Maitreya Patel, Sangmin Jung, Chitta Baral, and Yezhou Yang.λ-eclipse: Multi-concept personalized text-to-image diffusion models by leveraging clip latent space, 2024. 1, 3
work page 2024
-
[56]
Sdxl: Improving latent diffusion models for high-resolution image synthesis, 2023
Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas M ¨uller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis, 2023. 1
work page 2023
-
[57]
Unicontrol: A unified diffusion model for controllable visual generation in the wild, 2023
Can Qin, Shu Zhang, Ning Yu, Yihao Feng, Xinyi Yang, Yingbo Zhou, Huan Wang, Juan Carlos Niebles, Caiming Xiong, Silvio Savarese, Stefano Ermon, Yun Fu, and Ran Xu. Unicontrol: A unified diffusion model for controllable visual generation in the wild, 2023. 1
work page 2023
-
[58]
Qwen-image-edit-2509.https : / / huggingface
QwenLM. Qwen-image-edit-2509.https : / / huggingface . co / Qwen / Qwen - Image - Edit - 2509, 2025. Accessed: 2025-11-12. 3, 8, 15
work page 2025
-
[59]
Sam 2: Segment anything in images and videos,
Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman R¨adle, Chloe Rolland, Laura Gustafson, Eric Mintun, Junt- ing Pan, Kalyan Vasudev Alwala, Nicolas Carion, Chao- Yuan Wu, Ross Girshick, Piotr Doll´ar, and Christoph Feicht- enhofer. Sam 2: Segment anything in images and videos,
-
[60]
High-resolution image syn- thesis with latent diffusion models, 2022
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj¨orn Ommer. High-resolution image syn- thesis with latent diffusion models, 2022. 1
work page 2022
-
[61]
Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation, 2023
Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation, 2023. 1
work page 2023
-
[62]
Hyperdreambooth: Hypernetworks for fast personalization of text-to-image models, 2024
Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Wei Wei, Tingbo Hou, Yael Pritch, Neal Wadhwa, Michael Rubinstein, and Kfir Aberman. Hyperdreambooth: Hypernetworks for fast personalization of text-to-image models, 2024. 1
work page 2024
-
[63]
Seedream 4.0: Toward next-generation multimodal image generation, 2025
Team Seedream, Yunpeng Chen, Yu Gao, Lixue Gong, Meng Guo, Qiushan Guo, Zhiyao Guo, Xiaoxia Hou, Weilin Huang, Yixuan Huang, Xiaowen Jian, Huafeng Kuang, Zhichao Lai, Fanshi Li, Liang Li, Xiaochen Lian, Chao Liao, Liyang Liu, Wei Liu, Yanzuo Lu, Zhengxiong Luo, Tong- tong Ou, Guang Shi, Yichun Shi, Shiqi Sun, Yu Tian, Zhi Tian, Peng Wang, Rui Wang, Xun Wa...
work page 2025
-
[64]
Addendum to gpt-4o system card: Native image capabilities
Jerry Sima, Eric Cheng, William Fedus, Miles Brundage, Mark Chen, Iason Gabriel, Sandhini Agarwal, Lilian Weng, et al. Addendum to gpt-4o system card: Native image capabilities. https : / / www . semanticscholar . org / paper / 0c9b799e0dde7dcbe42f8dc61b242a0106739eba,
-
[65]
Accessed: 2025-10-18. 1
work page 2025
-
[66]
Oriane Sim ´eoni, Huy V . V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Micha ¨el Ramamonjisoa, Francisco Massa, Daniel Haziza, Luca Wehrstedt, Jianyuan Wang, Timoth´ee Darcet, Th´eo Moutakanni, Leonel Sentana, Claire Roberts, Andrea Vedaldi, Jamie Tolan, John Brandt, Camille Couprie,...
work page 2025
-
[67]
Insert anything: Image insertion via in-context editing in dit, 2025
Wensong Song, Hong Jiang, Zongxing Yang, Ruijie Quan, and Yi Yang. Insert anything: Image insertion via in-context editing in dit, 2025. 1
work page 2025
-
[68]
Ominicontrol: Minimal and universal control for diffusion transformer, 2025
Zhenxiong Tan, Songhua Liu, Xingyi Yang, Qiaochu Xue, and Xinchao Wang. Ominicontrol: Minimal and universal control for diffusion transformer, 2025. 3, 10
work page 2025
- [69]
-
[70]
Id-booth: Identity- consistent face generation with diffusion models
Darian Toma ˇsevi´c, Fadi Boutros, Chenhao Lin, Naser Damer, Vitomir ˇStruc, and Peter Peer. Id-booth: Identity- consistent face generation with diffusion models. In2025 IEEE 19th International Conference on Automatic Face and Gesture Recognition (FG), page 1–10. IEEE, 2025. 1
work page 2025
-
[71]
Delving into rl for image generation with cot: A study on dpo vs
Chengzhuo Tong, Ziyu Guo, Renrui Zhang, Wenyu Shan, Xinyu Wei, Zhenghao Xing, Hongsheng Li, and Pheng-Ann 23 Heng. Delving into rl for image generation with cot: A study on dpo vs. grpo.arXiv preprint arXiv:2505.17017, 2025. 12
-
[72]
Delving into rl for image generation with cot: A study on dpo vs
Chengzhuo Tong, Ziyu Guo, Renrui Zhang, Wenyu Shan, Xinyu Wei, Zhenghao Xing, Hongsheng Li, and Pheng-Ann Heng. Delving into rl for image generation with cot: A study on dpo vs. grpo, 2025. 1
work page 2025
-
[73]
Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muham- mad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, Olivier H ´enaff, Jeremiah Harmsen, Andreas Steiner, and Xiaohua Zhai. Siglip 2: Multilingual vision- language encoders with improved semantic understanding, localization, and dense feature...
work page 2025
-
[74]
Mulan: A multi layer annotated dataset for controllable text-to-image generation, 2024
Petru-Daniel Tudosiu, Yongxin Yang, Shifeng Zhang, Fei Chen, Steven McDonagh, Gerasimos Lampouras, Ignacio Ia- cobacci, and Sarah Parisot. Mulan: A multi layer annotated dataset for controllable text-to-image generation, 2024. 3, 10
work page 2024
-
[75]
Cloud- device collaborative learning for multimodal large language models
Guanqun Wang, Jiaming Liu, Chenxuan Li, Yuan Zhang, Junpeng Ma, Xinyu Wei, Kevin Zhang, Maurice Chong, Renrui Zhang, Yijiang Liu, and Shanghang Zhang. Cloud- device collaborative learning for multimodal large language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 12646–12655, 2024. 12
work page 2024
-
[76]
Guanqun Wang, Xinyu Wei, Jiaming Liu, Ray Zhang, Yichi Zhang, Kevin Zhang, Maurice Chong, and Shang- hang Zhang. Mr-mllm: Mutual reinforcement of multi- modal comprehension and vision perception.arXiv preprint arXiv:2406.15768, 2024. 12
-
[77]
Skywork unipic: Unified autoregres- sive modeling for visual understanding and generation, 2025
Peiyu Wang, Yi Peng, Yimeng Gan, Liang Hu, Tianyidan Xie, Xiaokun Wang, Yichen Wei, Chuanxin Tang, Bo Zhu, Changshi Li, Hongyang Wei, Eric Li, Xuchen Song, Yang Liu, and Yahui Zhou. Skywork unipic: Unified autoregres- sive modeling for visual understanding and generation, 2025. 1
work page 2025
-
[78]
Seededit 3.0: Fast and high-quality generative image editing,
Peng Wang, Yichun Shi, Xiaochen Lian, Zhonghua Zhai, Xin Xia, Xuefeng Xiao, Weilin Huang, and Jianchao Yang. Seededit 3.0: Fast and high-quality generative image editing,
-
[79]
Instantid: Zero- shot identity-preserving generation in seconds, 2024
Qixun Wang, Xu Bai, Haofan Wang, Zekui Qin, Anthony Chen, Huaxia Li, Xu Tang, and Yao Hu. Instantid: Zero- shot identity-preserving generation in seconds, 2024. 1
work page 2024
-
[80]
Ms-diffusion: Multi-subject zero-shot image per- sonalization with layout guidance, 2025
Xierui Wang, Siming Fu, Qihan Huang, Wanggui He, and Hao Jiang. Ms-diffusion: Multi-subject zero-shot image per- sonalization with layout guidance, 2025. 1, 3
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.