arxiv: 2605.12088 · v2 · submitted 2026-05-12 · 💻 cs.CV

Recognition: no theorem link

UniCustom: Unified Visual Conditioning for Multi-Reference Image Generation

Yiyan Xu , Qiulin Wang , Wenjie Wang , Yunyao Mao , Xintao Wang , Pengfei Wan , Kun Gai , Fuli Feng

Authors on Pith no claims yet

Pith reviewed 2026-05-14 21:57 UTC · model grok-4.3

classification 💻 cs.CV

keywords multi-reference image generationvisual conditioningfeature fusionvision-language modelsdiffusion modelssubject consistencyimage synthesis

0 comments

The pith

Fusing ViT and VAE features early before VLM encoding unifies conditioning for multi-reference image generation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Multi-reference image generation creates new pictures from text instructions while keeping the exact identities and looks of subjects shown in several reference photos. Existing VLM-based diffusion methods handle semantic features from ViT separately from appearance features from VAE, which often breaks the link between a subject and its correct visual details. UniCustom instead fuses the two feature types with a simple linear layer right before the VLM sees them, so the model’s hidden states learn to represent both the subject and its appearance together. A two-stage training process first preserves appearance through reconstruction, then finetunes for generation, and a slot-wise regularization term keeps each reference slot tied to its own low-level details.

Core claim

UniCustom introduces a unified visual conditioning framework that fuses ViT semantic features and VAE appearance features prior to VLM encoding, allowing hidden states to jointly encode the referred subject and its visual appearance. This is achieved with a lightweight linear fusion layer, followed by reconstruction-oriented pretraining that preserves reference-specific details, supervised finetuning on single- and multi-reference tasks, and slot-wise binding regularization to reduce cross-reference entanglement.

What carries the argument

early fusion of ViT and VAE features before VLM encoding using a lightweight linear layer

If this is right

Subject consistency rises across multiple references in the generated images.
Textual instructions are followed more accurately in scenes with several subjects.
Compositional fidelity improves and attribute leakage decreases compared with decoupled baselines.
Cross-reference confusion drops when prompts involve more than one reference image.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The separation of semantic and appearance streams appears to be the main source of binding failures in current multi-reference systems.
The same early-fusion pattern could be tested on single-reference or video generation tasks to check whether it generalizes.
Slot-wise regularization may prove useful in other multimodal models that must keep multiple visual inputs distinct.

Load-bearing premise

That fusing ViT and VAE features early will let the VLM hidden states jointly encode subject semantics and appearance details without losing information or adding complexity that hurts performance.

What would settle it

If controlled tests on the same benchmarks show no measurable gain in subject consistency scores or no reduction in attribute leakage when early fusion is removed, the central claim would be falsified.

Figures

Figures reproduced from arXiv: 2605.12088 by Fuli Feng, Kun Gai, Pengfei Wan, Qiulin Wang, Wenjie Wang, Xintao Wang, Yiyan Xu, Yunyao Mao.

**Figure 1.** Figure 1: Illustration of decoupled and unified visual conditioning. Recent VLM-enhanced diffusion models (e.g., OmniGen2 [36], Qwen-Image-Edit [35], LongCat-ImageEdit [29]) provide a promising framework for this task by leveraging the multimodal understanding and instruction-following ability of Vision-Language Models (VLMs). As illustrated in [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: Overview of UniCustom. UniCustom fuses ViT and VAE features before VLM encoding, producing semantically addressable and appearance-aware hidden states for DiT generation. To summarize, our contributions are as follows: • We identify the grounding–binding gap in VLM-enhanced diffusion models for multi-reference image generation. In existing decoupled conditioning designs, the DiT must implicitly associate V… view at source ↗

**Figure 3.** Figure 3: Two-stage training strategy. The first stage progressively learns a unified visual representation that supports fine-grained reference encoding, semantic grounding, and reliable textual-to-visual binding through reconstruction-oriented multi-image pretraining. The second stage further adapts the diffusion backbone to reference-based image generation, enabling instruction-following synthesis with single or… view at source ↗

**Figure 4.** Figure 4: Qualitative comparison on OmniContext [36]. 3.2 Main Results [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Qualitative comparison on MICo-Bench [34]. Qualitative evaluation. We provide qualitative comparisons with competitive baselines on OmniContext and MICo-Bench in [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Attention visualization of UniCustom. To further analyze how UniCustom leverages multiple visual references, we visualize the internal attention maps of the DiT in multi-reference image generation. As shown in [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: Effect of slot-wise binding regularization, where “Recon.” and “Local.” denote “Reconstruction” and “Localization”, respectively. Original IMG Reconstructed Reconstructed Reconstructed Binding Gap! (a) Single-image Recon. (b) Multi-image Recon. & Local. (c) Multi-image Tiling (d) Slot-wise Binding Loss [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

**Figure 8.** Figure 8: Effect of different fusion strategies, where “Recon.” and “Local.” denote “Reconstruction” and “Localization”, respectively. 3.4 Ablation Study We conduct ablations in the pretraining stage to validate two key design choices: slot-wise binding regularization and early fusion. We report single-image reconstruction Signal-to-Noise Ratio (PSNR) [12], and multi-image reconstruction, localization, and tiling a… view at source ↗

**Figure 9.** Figure 9: More generated examples of UniCustom on OmniContext [36]. Instruction: A paddling watercraft, ideal for serene waters from image 1 floats gracefully on a tranquil lake, while a moisturizing, oil-infused bath bomb for skincare from image 2 rests on the sun- warmed shoreline. Nearby, a traditional formal outfit for men from image 3 is elegantly draped over a rock, creating an intriguing juxtaposition. Instru… view at source ↗

**Figure 10.** Figure 10: More generated examples of UniCustom on MICo-Bench [34]. 14 [PITH_FULL_IMAGE:figures/full_fig_p014_10.png] view at source ↗

**Figure 11.** Figure 11: Generated examples of UniCustom on image localization. Image editing and text-to-image generation. We further present qualitative results of UniCustom on image editing and text-to-image generation in Figures 12 and 13, respectively. Although UniCustom incorporates only a small amount of image editing and text-to-image data as auxiliary tasks to improve instruction adherence, it demonstrates strong genera… view at source ↗

**Figure 12.** Figure 12: Generated examples of UniCustom on image editing. 15 [PITH_FULL_IMAGE:figures/full_fig_p015_12.png] view at source ↗

**Figure 13.** Figure 13: Generated examples of UniCustom on text-to-image generation. D Limitations and Future Work UniCustom achieves strong multi-reference image generation performance among open-source models, demonstrating the promise of unified visual conditioning for reference-based generation. Nevertheless, a gap remains compared with closed-source systems, particularly in highly realistic human identity preservation, comp… view at source ↗

read the original abstract

Multi-reference image generation aims to synthesize images from textual instructions while faithfully preserving subject identities from multiple reference images. Existing VLM-enhanced diffusion models commonly rely on decoupled visual conditioning: semantic ViT features are processed by the VLM for instruction understanding, whereas appearance-rich VAE features are injected later into the diffusion backbone. Despite its intuitive design, this separation makes it difficult for the model to associate each semantically grounded subject with visual details from the correct reference image. As a result, the model may recognize which subject is being referred to, but fail to preserve its identity and fine-grained appearance, leading to attribute leakage and cross-reference confusion in complex multi-reference settings. To address this issue, we propose UniCustom, a unified visual conditioning framework that fuses ViT and VAE features before VLM encoding. This early fusion exposes the VLM to both semantic cues and appearance-rich details, enabling its hidden states to jointly encode the referred subject and corresponding visual appearance with only a lightweight linear fusion layer. To learn such unified representations, we adopt a two-stage training strategy: reconstruction-oriented pretraining that preserves reference-specific appearance details in the fused hidden states, followed by supervised finetuning on single- and multi-reference generation tasks. We further introduce a slot-wise binding regularization that encourages each image slot to preserve low-level details of its corresponding reference, thereby reducing cross-reference entanglement. Experiments on two multi-reference generation benchmarks demonstrate that UniCustom consistently improves subject consistency, instruction following, and compositional fidelity over strong baselines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

UniCustom's early fusion of ViT and VAE features before the VLM is a direct attempt to reduce attribute leakage in multi-reference generation, but the payoff depends on how well the experiments back it up.

read the letter

The main point with UniCustom is that it fuses ViT and VAE features early, before the VLM processes them, to better bind subjects to their appearances in multi-reference image generation. This targets the attribute leakage problem that comes from handling semantics and details separately. The paper does a decent job explaining the issue and offering a practical fix: a simple linear fusion layer, reconstruction pretraining to retain details, followed by finetuning, and slot-wise regularization to prevent entanglement across references. The two-stage approach and the regularization seem like reasonable additions to make the unified conditioning work. What could be softer is the lack of visible quantitative details in the high-level description. The abstract says it improves over baselines on two benchmarks, but without seeing the specific metrics, ablations, or how much the fusion helps versus adds overhead, it's difficult to assess if the gains are substantial or consistent across cases. The full results would need to show that this doesn't come at the cost of other qualities like generation speed or diversity. This kind of work is aimed at people building or improving generative systems for controlled image synthesis, especially where multiple reference images need to be respected accurately. Readers focused on diffusion model conditioning techniques might find the method useful to build on. I'd suggest sending it for peer review. The core idea is straightforward and addresses a known pain point, so it merits a closer look at the experiments to confirm the benefits.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces UniCustom, a unified visual conditioning framework for multi-reference image generation. It fuses ViT semantic features and VAE appearance features early via a lightweight linear layer before VLM encoding, employs two-stage training (reconstruction pretraining followed by supervised finetuning on single- and multi-reference tasks), and adds slot-wise binding regularization to reduce cross-reference entanglement. The central claim is that this addresses attribute leakage and confusion in decoupled conditioning setups, yielding consistent gains in subject consistency, instruction following, and compositional fidelity on two benchmarks over strong baselines.

Significance. If the empirical gains hold under rigorous controls, the early-fusion design offers a practical alternative to decoupled ViT/VAE conditioning in VLM-augmented diffusion models, potentially reducing identity leakage in complex multi-subject scenes. The two-stage training and slot-wise regularization are straightforward to implement and could transfer to related tasks such as personalized generation or compositional editing.

major comments (2)

[Abstract, §4] Abstract and §4: the claim of 'consistent improvements' over baselines is stated without any quantitative metrics, ablation tables, or statistical significance tests in the provided text; this makes it impossible to assess effect sizes or baseline fairness and is load-bearing for the performance contribution.
[§3.2] §3.2: the early-fusion hypothesis (ViT+VAE features jointly encoded in VLM hidden states without information loss) is motivated but lacks an explicit derivation or information-theoretic argument showing why the linear fusion preserves low-level details better than late injection; an ablation isolating the fusion timing would be needed to support the architectural choice.

minor comments (2)

[§3.1] Notation for the number of reference slots and the exact dimensionality of the fused feature vector should be defined explicitly in §3.1 to avoid ambiguity when scaling to variable numbers of references.
[§3.3] The description of the reconstruction pretraining objective would benefit from the precise loss formulation (e.g., whether it is pixel-space MSE, perceptual loss, or a combination) to allow reproduction.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We address each major comment below and will revise the manuscript to strengthen the presentation of results and architectural motivations.

read point-by-point responses

Referee: [Abstract, §4] Abstract and §4: the claim of 'consistent improvements' over baselines is stated without any quantitative metrics, ablation tables, or statistical significance tests in the provided text; this makes it impossible to assess effect sizes or baseline fairness and is load-bearing for the performance contribution.

Authors: We agree that the abstract and §4 would benefit from explicit quantitative support. The full manuscript contains tables reporting subject consistency, instruction following, and compositional fidelity metrics on both benchmarks, along with ablations, but these were not summarized numerically in the abstract. In the revised version we will add specific delta values, standard deviations, and significance indicators to the abstract and expand §4 with a consolidated results table plus statistical tests. revision: yes
Referee: [§3.2] §3.2: the early-fusion hypothesis (ViT+VAE features jointly encoded in VLM hidden states without information loss) is motivated but lacks an explicit derivation or information-theoretic argument showing why the linear fusion preserves low-level details better than late injection; an ablation isolating the fusion timing would be needed to support the architectural choice.

Authors: The motivation in §3.2 is primarily empirical, showing that early fusion enables the VLM to bind appearance details to semantic slots before diffusion. We acknowledge the absence of a formal information-theoretic derivation or a dedicated timing ablation. In revision we will add a short paragraph deriving the benefit from reduced cross-attention entropy and include an ablation comparing early vs. late fusion on a controlled single-reference subset, reporting both quantitative metrics and qualitative leakage examples. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper describes an architectural proposal (early ViT+VAE fusion before VLM encoding, two-stage training, slot-wise regularization) and asserts empirical gains on two benchmarks. No equations, derivations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text. The central claim rests on experimental comparison rather than any self-referential reduction of outputs to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; all details would require the full manuscript.

pith-pipeline@v0.9.0 · 5589 in / 954 out tokens · 27932 ms · 2026-05-14T21:57:20.428084+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

45 extracted references · 45 canonical work pages · 15 internal anchors

[1]

Qwen2.5-VL Technical Report

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report. a...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer

Huanqia Cai, Sihan Cao, Ruoyi Du, Peng Gao, Steven Hoi, Zhaohui Hou, Shijie Huang, Dengyang Jiang, Xin Jin, Liangchen Li, et al. Z-image: An efficient image generation foundation model with single-stream diffusion transformer.arXiv preprint arXiv:2511.22699, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

XVerse: Consistent multi-subject control of identity and semantic attributes via dit modulation

Bowen Chen, Brynn zhao, Haomiao Sun, Li Chen, Xu Wang, Daniel Kang Du, and Xinglong Wu. XVerse: Consistent multi-subject control of identity and semantic attributes via dit modulation. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026

work page 2026
[4]

BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset

Jiuhai Chen, Zhiyang Xu, Xichen Pan, Yushi Hu, Can Qin, Tom Goldstein, Lifu Huang, Tianyi Zhou, Saining Xie, Silvio Savarese, et al. Blip3-o: A family of fully open unified multimodal models-architecture, training and dataset.arXiv preprint arXiv:2505.09568, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

Umo: Scaling multi-identity consistency for image customization via matching reward.arXiv preprint arXiv:2509.06818, 2025

Yufeng Cheng, Wenxu Wu, Shaojin Wu, Mengqi Huang, Fei Ding, and Qian He. Umo: Scaling multi-identity consistency for image customization via matching reward.arXiv preprint arXiv:2509.06818, 2025

work page arXiv 2025
[6]

Canvas-to-image: Composi- tional image generation with multimodal controls.arXiv preprint arXiv:2511.21691, 2025

Yusuf Dalva, Guocheng Gordon Qian, Maya Goldenberg, Tsai-Shien Chen, Kfir Aberman, Sergey Tulyakov, Pinar Yanardag, and Kuan-Chieh Jackson Wang. Canvas-to-image: Composi- tional image generation with multimodal controls.arXiv preprint arXiv:2511.21691, 2025

work page arXiv 2025
[7]

Emerging Properties in Unified Multimodal Pretraining

Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, et al. Emerging properties in unified multimodal pretraining.arXiv preprint arXiv:2505.14683, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[8]

Cinema: Coherent multi-subject video generation via mllm-based guidance.arXiv preprint arXiv:2503.10391, 2025

Yufan Deng, Xun Guo, Yizhi Wang, Jacob Zhiyuan Fang, Angtian Wang, Shenghai Yuan, Yiding Yang, Bo Liu, Haibin Huang, and Chongyang Ma. Cinema: Coherent multi-subject video generation via mllm-based guidance.arXiv preprint arXiv:2503.10391, 2025

work page arXiv 2025
[9]

Guid- ing instruction-based image editing via multimodal large language models

Tsu-Jui Fu, Wenze Hu, Xianzhi Du, William Yang Wang, Yinfei Yang, and Zhe Gan. Guid- ing instruction-based image editing via multimodal large language models. InThe Twelfth International Conference on Learning Representations, 2024

work page 2024
[10]

Tokenverse: Versatile multi-concept personalization in token modulation space.ACM Transactions On Graphics (TOG), 44(4):1–11, 2025

Daniel Garibi, Shahar Yadin, Roni Paiss, Omer Tov, Shiran Zada, Ariel Ephrat, Tomer Michaeli, Inbar Mosseri, and Tali Dekel. Tokenverse: Versatile multi-concept personalization in token modulation space.ACM Transactions On Graphics (TOG), 44(4):1–11, 2025

work page 2025
[11]

Nano banana 2: Combining pro capabilities with lightning-fast speed

Google. Nano banana 2: Combining pro capabilities with lightning-fast speed. https: //blog.google/innovation-and-ai/technology/ai/nano-banana-2/, 2026

work page 2026
[12]

Image quality metrics: Psnr vs

Alain Hore and Djemel Ziou. Image quality metrics: Psnr vs. ssim. In2010 20th international conference on pattern recognition, pages 2366–2369. IEEE, 2010

work page 2010
[13]

Hunyuancustom: A multimodal-driven architecture for customized video generation.arXiv preprint arXiv:2505.04512, 2025

Teng Hu, Zhentao Yu, Zhengguang Zhou, Sen Liang, Yuan Zhou, Qin Lin, and Qinglin Lu. Hunyuancustom: A multimodal-driven architecture for customized video generation.arXiv preprint arXiv:2505.04512, 2025

work page arXiv 2025
[14]

Human-art: A versatile human-centric dataset bridging natural and artificial scenes

Xuan Ju, Ailing Zeng, Jianan Wang, Qiang Xu, and Lei Zhang. Human-art: A versatile human-centric dataset bridging natural and artificial scenes. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 618–629, 2023

work page 2023
[15]

A style-based generator architecture for generative adversarial networks

Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4401–4410, 2019. 10

work page 2019
[16]

Nohumansrequired: Autonomous high-quality image editing triplet mining

Maksim Kuprashevich, Grigorii Alekseenko, Irina Tolstykh, Georgii Fedorov, Bulat Suleimanov, Vladimir Dokholyan, and Aleksandr Gordeev. Nohumansrequired: Autonomous high-quality image editing triplet mining. InProceedings of the IEEE/CVF Winter Conference on Applica- tions of Computer Vision, pages 6059–6068, 2026

work page 2026
[17]

Black Forest Labs, Stephen Batifol, Andreas Blattmann, Frederic Boesel, Saksham Consul, Cyril Diagne, Tim Dockhorn, Jack English, Zion English, Patrick Esser, et al. Flux. 1 kontext: Flow matching for in-context image generation and editing in latent space.arXiv preprint arXiv:2506.15742, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[18]

BLIP-diffusion: Pre-trained subject representation for controllable text-to-image generation and editing

Dongxu Li, Junnan Li, and Steven Hoi. BLIP-diffusion: Pre-trained subject representation for controllable text-to-image generation and editing. InThirty-seventh Conference on Neural Information Processing Systems, 2023

work page 2023
[19]

Open-sora plan: Open-source large video generation model.arXiv preprint arXiv:2412.00131,

Bin Lin, Yunyang Ge, Xinhua Cheng, Zongjian Li, Bin Zhu, Shaodong Wang, Xianyi He, Yang Ye, Shenghai Yuan, Liuhan Chen, et al. Open-sora plan: Open-source large video generation model.arXiv preprint arXiv:2412.00131, 2024

work page arXiv 2024
[20]

UniWorld-V1: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation

Bin Lin, Zongjian Li, Xinhua Cheng, Yuwei Niu, Yang Ye, Xianyi He, Shenghai Yuan, Wangbo Yu, Shaodong Wang, Yunyang Ge, et al. Uniworld-v1: High-resolution semantic encoders for unified visual understanding and generation.arXiv preprint arXiv:2506.03147, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[21]

Microsoft coco: Common objects in context

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. InEuropean conference on computer vision, pages 740–755. Springer, 2014

work page 2014
[22]

Step1X-Edit: A Practical Framework for General Image Editing

Shiyu Liu, Yucheng Han, Peng Xing, Fukun Yin, Rui Wang, Wei Cheng, Jiaqi Liao, Yingming Wang, Honghao Fu, Chunrui Han, et al. Step1x-edit: A practical framework for general image editing.arXiv preprint arXiv:2504.17761, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[23]

Wan-Image: Pushing the Boundaries of Generative Visual Intelligence

Chaojie Mao, Chen-Wei Xie, Chongyang Zhong, Haoyou Deng, Jiaxing Zhao, Jie Xiao, Jinbo Xing, Jingfeng Zhang, Jingren Zhou, Jingyi Zhang, et al. Wan-image: Pushing the boundaries of generative visual intelligence.arXiv preprint arXiv:2604.19858, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[24]

Dreamo: A unified framework for image customization

Chong Mou, Yanze Wu, Wenxu Wu, Zinan Guo, Pengze Zhang, Yufeng Cheng, Yiming Luo, Fei Ding, Shiwen Zhang, Xinghui Li, et al. Dreamo: A unified framework for image customization. InProceedings of the SIGGRAPH Asia 2025 Conference Papers, pages 1–12, 2025

work page 2025
[25]

Introducing chatgpt images 2.0

OpenAI. Introducing chatgpt images 2.0. https://openai.com/index/ introducing-chatgpt-images-2-0/, 2026

work page 2026
[26]

Kosmos- g: Generating images in context with multimodal large language models

Xichen Pan, Li Dong, Shaohan Huang, Zhiliang Peng, Wenhu Chen, and Furu Wei. Kosmos- g: Generating images in context with multimodal large language models. InThe Twelfth International Conference on Learning Representations, 2024

work page 2024
[27]

Pico-banana-400k: A large-scale dataset for text-guided image editing.arXiv preprint arXiv:2510.19808, 2025

Yusu Qian, Eli Bocek-Rivele, Liangchen Song, Jialing Tong, Yinfei Yang, Jiasen Lu, Wenze Hu, and Zhe Gan. Pico-banana-400k: A large-scale dataset for text-guided image editing.arXiv preprint arXiv:2510.19808, 2025

work page arXiv 2025
[28]

MOSAIC: Multi-subject personalized generation via correspondence-aware alignment and disentanglement

Dong She, Siming Fu, Mushui Liu, Qiaoqiao Jin, Hualiang Wang, Mu Liu, and Jidong Jiang. MOSAIC: Multi-subject personalized generation via correspondence-aware alignment and disentanglement. InThe Fourteenth International Conference on Learning Representations, 2026

work page 2026
[29]

Longcat-image technical report

Meituan LongCat Team, Hanghang Ma, Haoxian Tan, Jiale Huang, Junqiang Wu, Jun-Yan He, Lishuai Gao, Songlin Xiao, Xiaoming Wei, Xiaoqi Ma, et al. Longcat-image technical report. arXiv preprint arXiv:2512.07584, 2025

work page arXiv 2025
[30]

The unsplash dataset.https://github.com/unsplash/datasets, 2025

Unsplash. The unsplash dataset.https://github.com/unsplash/datasets, 2025

work page 2025
[31]

PSR: Scaling Multi-Subject Personalized Image Generation with Pairwise Subject-Consistency Rewards

Shulei Wang, Longhui Wei, Xin He, Jianbo Ouyang, Hui Lu, Zhou Zhao, and Qi Tian. Psr: Scaling multi-subject personalized image generation with pairwise subject-consistency rewards. arXiv preprint arXiv:2512.01236, 2025. 11

work page internal anchor Pith review Pith/arXiv arXiv 2025
[32]

MS-diffusion: Multi- subject zero-shot image personalization with layout guidance

Xierui Wang, Siming Fu, Qihan Huang, Wanggui He, and Hao Jiang. MS-diffusion: Multi- subject zero-shot image personalization with layout guidance. InThe Thirteenth International Conference on Learning Representations, 2025

work page 2025
[33]

Scone: Bridging Composition and Distinction in Subject-Driven Image Generation via Unified Understanding-Generation Modeling

Yuran Wang, Bohan Zeng, Chengzhuo Tong, Wenxuan Liu, Yang Shi, Xiaochen Ma, Hao Liang, Yuanxing Zhang, and Wentao Zhang. Scone: Bridging composition and distinction in subject-driven image generation via unified understanding-generation modeling.arXiv preprint arXiv:2512.12675, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[34]

MICo-150K: A Comprehensive Dataset Advancing Multi-Image Composition

Xinyu Wei, Kangrui Cen, Hongyang Wei, Zhen Guo, Bairui Li, Zeqing Wang, Jinrui Zhang, and Lei Zhang. Mico-150k: A comprehensive dataset advancing multi-image composition.arXiv preprint arXiv:2512.07348, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[35]

Qwen-Image Technical Report

Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al. Qwen-image technical report.arXiv preprint arXiv:2508.02324, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[36]

OmniGen2: Towards Instruction-Aligned Multimodal Generation

Chenyuan Wu, Pengfei Zheng, Ruiran Yan, Shitao Xiao, Xin Luo, Yueze Wang, Wanli Li, Xiyan Jiang, Yexin Liu, Junjie Zhou, et al. Omnigen2: Exploration to advanced multimodal generation.arXiv preprint arXiv:2506.18871, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[37]

Uso: Unified style and subject-driven generation via disentangled and reward learning.arXiv preprint arXiv:2508.18966, 2025

Shaojin Wu, Mengqi Huang, Yufeng Cheng, Wenxu Wu, Jiahe Tian, Yiming Luo, Fei Ding, and Qian He. Uso: Unified style and subject-driven generation via disentangled and reward learning.arXiv preprint arXiv:2508.18966, 2025

work page arXiv 2025
[38]

Less-to-more generalization: Unlocking more controllability by in-context generation

Shaojin Wu, Mengqi Huang, Wenxu Wu, Yufeng Cheng, Fei Ding, and Qian He. Less-to-more generalization: Unlocking more controllability by in-context generation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 18682–18692, 2025

work page 2025
[39]

Omnicontrol: Control any joint at any time for human motion generation

Yiming Xie, Varun Jampani, Lei Zhong, Deqing Sun, and Huaizu Jiang. Omnicontrol: Control any joint at any time for human motion generation. InThe Twelfth International Conference on Learning Representations, 2024

work page 2024
[40]

Withanyone: Toward controllable and ID consistent image generation

Hengyuan Xu, Wei Cheng, Peng Xing, Yixiao Fang, Shuhan Wu, Rui Wang, Xianfang Zeng, Daxin Jiang, Gang YU, Xingjun Ma, and Yu-Gang Jiang. Withanyone: Toward controllable and ID consistent image generation. InThe Fourteenth International Conference on Learning Representations, 2026

work page 2026
[41]

Hq-50k: A large-scale, high-quality dataset for image restoration

Qinhong Yang, Dongdong Chen, Zhentao Tan, Qiankun Liu, Qi Chu, Jianmin Bao, Lu Yuan, Gang Hua, and Nenghai Yu. Hq-50k: A large-scale, high-quality dataset for image restoration. arXiv preprint arXiv:2306.05390, 2023

work page arXiv 2023
[42]

IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models

Hu Ye, Jun Zhang, Sibo Liu, Xiao Han, and Wei Yang. Ip-adapter: Text compatible image prompt adapter for text-to-image diffusion models.arXiv preprint arXiv:2308.06721, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[43]

Echo-4o: Harnessing the power of gpt-4o synthetic images for improved image generation.arXiv preprint arXiv:2508.09987, 2025

Junyan Ye, Dongzhi Jiang, Zihao Wang, Leqi Zhu, Zhenghao Hu, Zilong Huang, Jun He, Zhiyuan Yan, Jinghua Yu, Hongsheng Li, et al. Echo-4o: Harnessing the power of gpt-4o synthetic images for improved image generation.arXiv preprint arXiv:2508.09987, 2025

work page arXiv 2025
[44]

ImgEdit: A Unified Image Editing Dataset and Benchmark

Yang Ye, Xianyi He, Zongjian Li, Bin Lin, Shenghai Yuan, Zhiyuan Yan, Bohan Hou, and Li Yuan. Imgedit: A unified image editing dataset and benchmark.arXiv preprint arXiv:2505.20275, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[45]

Reconstruct Picture i with the highest possible fidelity

Yi Zhang, Bolin Ni, Xin-Sheng Chen, Heng-Rui Zhang, Yongming Rao, Houwen Peng, Qinglin Lu, Han Hu, Meng-Hao Guo, and Shi-Min Hu. Bee: A high-quality corpus and full-stack suite to unlock advanced fully open mllms.arXiv preprint arXiv:2510.13795, 2025. 12 A Implementation Details During pretraining of UniCustom, we jointly optimize the fusion layer and the...

work page arXiv 2025