arxiv: 2512.12675 · v2 · submitted 2025-12-14 · 💻 cs.CV · cs.AI

Scone: Bridging Composition and Distinction in Subject-Driven Image Generation via Unified Understanding-Generation Modeling

Yuran Wang , Bohan Zeng , Chengzhuo Tong , Wenxuan Liu , Yang Shi , Xiaochen Ma , Hao Liang , Yuanxing Zhang

show 1 more author

Wentao Zhang

This is my paper

Pith reviewed 2026-05-16 22:36 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords subject-driven image generationmulti-subject compositionsubject distinctionunified understanding-generationsemantic bridgetwo-stage trainingattention-based maskingSconeEval benchmark

0 comments p. Extension

The pith

Scone uses an understanding expert as a semantic bridge to let generation models handle both multi-subject composition and correct distinction without interference.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Subject-driven image generation has improved at combining several subjects but still fails when prompts contain multiple similar candidates and the model must pick the right one. Scone introduces a single architecture that trains an understanding module and a generation module together so the understanding module can pass semantic signals to the generator. The method first teaches composition, then adds distinction through semantic alignment and attention masking that prevents cross-subject leakage. Experiments on two benchmarks show the resulting model outperforms prior open-source systems at both tasks while preserving subject identity.

Core claim

Scone is a unified understanding-generation model for subject-driven image generation. The understanding expert functions as a semantic bridge that conveys information to the generation expert, enabling it to preserve subject identity while minimizing interference among multiple subjects. Training proceeds in two stages: the first stage learns composition, and the second stage strengthens distinction via semantic alignment and attention-based masking. The approach is evaluated on the new SconeEval benchmark and on existing benchmarks, where it surpasses prior open-source models in both composition and distinction metrics.

What carries the argument

The understanding expert acting as a semantic bridge inside a unified understanding-generation architecture, trained with a two-stage schedule of composition learning followed by semantic alignment and attention-based masking.

If this is right

Multi-subject prompts become reliable for realistic scene generation without manual subject isolation.
Subject identity is preserved across varying contexts while avoiding leakage from other reference images.
The SconeEval benchmark provides a standardized way to measure both composition accuracy and distinction correctness.
Open-source models can now be fine-tuned with the same two-stage recipe to close the gap with closed models on complex subject tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The semantic-bridge pattern could transfer to video or 3D generation where temporal or spatial consistency across subjects is required.
Attention masking during the distinction stage may generalize to other conditional generation settings that need selective focus on reference signals.
Combining the method with larger pretrained understanding models could further reduce the data needed for the distinction stage.

Load-bearing premise

The two-stage training with semantic alignment and attention-based masking can strengthen distinction without lowering composition quality or creating new interference between subjects.

What would settle it

A controlled test set of prompts containing two visually similar subjects where the model after distinction training either swaps identities or produces lower composition fidelity than the composition-only checkpoint.

Figures

Figures reproduced from arXiv: 2512.12675 by Bohan Zeng, Chengzhuo Tong, Hao Liang, Wentao Zhang, Wenxuan Liu, Xiaochen Ma, Yang Shi, Yuanxing Zhang, Yuran Wang.

**Figure 2.** Figure 2: Our motivation. (a) visualizes the early similarity between image token hidden states from the understanding and generation experts and text token hidden states within the unified model, showing that the former attends to semantic regions while the latter is less sensitive. (b) illustrates the collaboration between the understanding and generation experts within the unified model through end-to-end trainin… view at source ↗

**Figure 3.** Figure 3: Understanding bridge strategy. Step 1: Understand [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

**Figure 4.** Figure 4: Overview of our SconeEval benchmark. Char”: character, Obj”: object, “Sce”: scene. SconeEval evaluates target subject identification and generation in complex visual contexts. It provides 409 test cases across three domains with 19 case types and 6 subtasks, covering composition, distinction, and distinction & composition tasks [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Multi-candidate editing in our SconeEval bench [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗

**Figure 6.** Figure 6: Qualitative comparison of existing models on OmniContext [ [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Qualitative comparison of existing models on SconeEval benchmark. [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

**Figure 8.** Figure 8: Stability measured by the standard deviation of [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗

**Figure 9.** Figure 9: Prompt for distinction scoring in SconeEval bench [PITH_FULL_IMAGE:figures/full_fig_p012_9.png] view at source ↗

**Figure 10.** Figure 10: Representative similarity and masked images for each layer group. [PITH_FULL_IMAGE:figures/full_fig_p013_10.png] view at source ↗

**Figure 11.** Figure 11: Limitation of our Scone. E. Limitation Our Scone still exhibits a common limitation found in existing methods: unrealistic interaction. As shown in [PITH_FULL_IMAGE:figures/full_fig_p013_11.png] view at source ↗

**Figure 12.** Figure 12: Examples of synthesized data with 3 input images. [PITH_FULL_IMAGE:figures/full_fig_p014_12.png] view at source ↗

**Figure 13.** Figure 13: Examples of synthesized data with 4 input images. [PITH_FULL_IMAGE:figures/full_fig_p014_13.png] view at source ↗

**Figure 14.** Figure 14: Data filtering for refined single-candidate data. (a) Prompt for training data filtering. [PITH_FULL_IMAGE:figures/full_fig_p015_14.png] view at source ↗

**Figure 15.** Figure 15: Multi-candidate single-subject data construction. (a) Prompt for instruction construction. [PITH_FULL_IMAGE:figures/full_fig_p015_15.png] view at source ↗

**Figure 16.** Figure 16: Multi-candidate multi-subject data construction. (a) Prompts for subject replacement. [PITH_FULL_IMAGE:figures/full_fig_p016_16.png] view at source ↗

**Figure 17.** Figure 17: Comparison between two-step decoupling and direct strategies for instruction construction. [PITH_FULL_IMAGE:figures/full_fig_p017_17.png] view at source ↗

**Figure 18.** Figure 18: Prompts for instruction construction in SconeEval benchmark. (a) Prompt for subject identification. [PITH_FULL_IMAGE:figures/full_fig_p017_18.png] view at source ↗

read the original abstract

Subject-driven image generation has advanced from single- to multi-subject composition, while neglecting distinction, the ability to distinguish and generate the correct subject when inputs contain multiple candidates. This limitation restricts effectiveness in complex, realistic visual settings. We propose Scone, a unified understanding-generation method that integrates composition and distinction. Scone enables the understanding expert to act as a semantic bridge, conveying semantic information and guiding the generation expert to preserve subject identity while minimizing interference. A two-stage training scheme first learns composition, then enhances distinction through semantic alignment and attention-based masking. We also introduce SconeEval, a benchmark for evaluating both composition and distinction across diverse scenarios. Experiments demonstrate that Scone outperforms existing open-source models in composition and distinction tasks on two benchmarks. Our model, benchmark, and training data are available at: https://github.com/Ryann-Ran/Scone.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Scone adds a two-stage training scheme and SconeEval benchmark to tackle distinction alongside composition in multi-subject generation, but the mechanism lacks isolating evidence.

read the letter

The paper's core move is to treat understanding and generation as a single model where the understanding side acts as a semantic bridge to help the generator keep subjects distinct in crowded scenes. They train first for basic composition, then add semantic alignment plus attention masking in stage two, and they release SconeEval to measure both capabilities together. The open release of model, data, and benchmark is the clearest practical contribution here; anyone working on subject-driven synthesis can actually use the resources without starting from scratch. The reported gains over other open models on the two benchmarks are consistent with the abstract claim, and the unified architecture is a straightforward way to reduce interference between subjects. The soft spot is exactly the one the stress-test note flags. No component ablations appear to separate the effect of the alignment and masking steps from simply continuing training on additional multi-subject examples. Without those controls or before-and-after interference metrics, it is difficult to confirm that the semantic-bridge role is doing the work rather than extra data volume. The abstract also gives no statistical details or exact metric definitions, which leaves the performance edge plausible but not tightly verified. This work is aimed at computer-vision researchers building or evaluating subject-driven generators, especially those who need test cases with multiple similar subjects. A reader in that niche will find the benchmark and released assets worth looking at. It deserves a serious referee because the problem is concrete, the resources are public, and the empirical results are at least directionally positive, even if the next round of review would need clearer isolation of the training contributions.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces Scone, a unified understanding-generation architecture for subject-driven image generation that jointly addresses composition (integrating multiple subjects) and distinction (correctly identifying and rendering specific subjects amid candidates to minimize interference). The core mechanism positions the understanding expert as a semantic bridge that guides the generation expert. Training proceeds in two stages: an initial composition-focused stage followed by distinction enhancement via semantic alignment and attention-based masking. The authors release a new benchmark SconeEval and report that Scone outperforms existing open-source models on composition and distinction tasks across two benchmarks. Model weights, benchmark, and training data are made publicly available.

Significance. If the performance gains are reproducible and attributable to the proposed mechanisms rather than training volume, the work would meaningfully advance multi-subject generation by explicitly modeling distinction, a previously under-addressed capability needed for realistic scenes. The conceptual framing of understanding as a semantic bridge and the public release of code, benchmark, and data constitute clear strengths that support reproducibility and follow-on research.

major comments (2)

[§4 (Experiments)] §4 (Experiments) and abstract: Performance claims state that Scone outperforms open-source models, yet no details are supplied on experimental controls, statistical significance, variance across runs, or precise metric values and baseline implementations. This absence prevents verification of the central empirical result.
[Training section (§3.2)] Training section (likely §3.2): The claim that stage-2 semantic alignment plus attention-based masking specifically improves distinction while preserving composition rests on the two-stage scheme. No component-wise ablations, before/after interference metrics (e.g., subject-swap error rates), or controls for additional data volume are reported, leaving open the possibility that observed gains arise simply from continued training on multi-subject examples rather than the proposed mechanisms.

minor comments (2)

[Abstract] Abstract: The statement 'outperforms ... on two benchmarks' should explicitly name the second benchmark in addition to SconeEval for immediate clarity.
[Throughout] Notation and terminology: Ensure consistent capitalization and definition of 'understanding expert' and 'generation expert' on first use and in all figure captions.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the constructive feedback. We appreciate the referee's emphasis on reproducibility and the need to isolate the contributions of our proposed mechanisms. We will revise the manuscript to provide the requested details and ablations.

read point-by-point responses

Referee: [§4 (Experiments)] §4 (Experiments) and abstract: Performance claims state that Scone outperforms open-source models, yet no details are supplied on experimental controls, statistical significance, variance across runs, or precise metric values and baseline implementations. This absence prevents verification of the central empirical result.

Authors: We agree that the current presentation lacks sufficient detail for independent verification. In the revised manuscript, we will expand §4 and the abstract to report: exact numerical metric values with standard deviations across three random seeds, baseline implementation details (including code references and hyperparameter settings), statistical significance via paired t-tests, and full experimental controls such as evaluation protocols and data splits. These additions will directly address the verification concern. revision: yes
Referee: [Training section (§3.2)] Training section (likely §3.2): The claim that stage-2 semantic alignment plus attention-based masking specifically improves distinction while preserving composition rests on the two-stage scheme. No component-wise ablations, before/after interference metrics (e.g., subject-swap error rates), or controls for additional data volume are reported, leaving open the possibility that observed gains arise simply from continued training on multi-subject examples rather than the proposed mechanisms.

Authors: We acknowledge that the manuscript does not currently include the requested ablations or controls. In the revision, we will add a dedicated ablation subsection in §3.2 and §4 showing: (i) component-wise results with and without semantic alignment and attention-based masking, (ii) before/after subject-swap error rates and other distinction-specific interference metrics, and (iii) a control experiment training for equivalent total steps on the same multi-subject data but omitting the distinction-specific losses. These results will demonstrate that the observed gains are attributable to the proposed mechanisms rather than data volume or continued training alone. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical claims rest on benchmarks

full rationale

The paper proposes a two-stage training scheme (composition learning followed by semantic alignment and attention-based masking) for a unified model and supports its claims via experimental results on SconeEval and other benchmarks. No equations, derivations, fitted parameters presented as predictions, self-definitional constructs, or load-bearing self-citations appear in the provided text. The semantic-bridge role is asserted through the training procedure and performance comparisons rather than reducing to its own inputs by construction. This is a standard empirical ML paper with independent experimental validation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities beyond standard deep learning assumptions; the 'understanding expert' and 'generation expert' are presented as architectural components rather than new physical entities.

pith-pipeline@v0.9.0 · 5476 in / 1081 out tokens · 31461 ms · 2026-05-16T22:36:39.529349+00:00 · methodology

discussion (0)

Forward citations

Cited by 6 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Edit-Compass & EditReward-Compass: A Unified Benchmark for Image Editing and Reward Modeling
cs.CV 2026-05 unverdicted novelty 7.0

Edit-Compass and EditReward-Compass are new unified benchmarks for fine-grained image editing evaluation and realistic reward modeling in reinforcement learning optimization.
UniCustom: Unified Visual Conditioning for Multi-Reference Image Generation
cs.CV 2026-05 unverdicted novelty 7.0

UniCustom fuses ViT and VAE features before VLM encoding and uses two-stage training plus slot-wise regularization to improve subject consistency in multi-reference diffusion-based image generation.
UniCustom: Unified Visual Conditioning for Multi-Reference Image Generation
cs.CV 2026-05 unverdicted novelty 6.0

A unified visual conditioning approach fuses semantic and appearance features before VLM processing, with two-stage training and slot-wise regularization, to improve consistency in multi-reference image generation.
HiDream-O1-Image: A Natively Unified Image Generative Foundation Model with Pixel-level Unified Transformer
cs.CV 2026-05 unverdicted novelty 6.0

A pixel-space Diffusion Transformer with Unified Transformer architecture unifies image generation, editing, and personalization in an end-to-end model that maps all inputs to a shared token space and scales from 8B t...
Reinforcement-Guided Synthetic Data Generation for Privacy-Sensitive Identity Recognition
cs.CV 2026-04 unverdicted novelty 5.0

A reinforcement learning approach adapts general generative models to produce synthetic data that boosts identity recognition accuracy and generalization under privacy constraints.
OpenWorldLib: A Unified Codebase and Definition of Advanced World Models
cs.CV 2026-04 unverdicted novelty 4.0

OpenWorldLib offers a standardized codebase and definition for world models that combine perception, interaction, and memory to understand and predict the world.

Reference graph

Works this paper leans on

49 extracted references · 49 canonical work pages · cited by 5 Pith papers · 10 internal anchors

[1]

Mc-llava: Multi-concept personalized vision-language model.arXiv preprint arXiv:2411.11706, 2024

Ruichuan An, Sihan Yang, Ming Lu, Renrui Zhang, Kai Zeng, Yulin Luo, Jiajun Cao, Hao Liang, Ying Chen, Qi She, et al. Mc-llava: Multi-concept personalized vision-language model.arXiv preprint arXiv:2411.11706, 2024. 1

work page arXiv 2024
[2]

Unictokens: Boosting personalized understand- ing and generation via unified concept tokens.arXiv preprint arXiv:2505.14671, 2025

Ruichuan An, Sihan Yang, Renrui Zhang, Zijun Shen, Ming Lu, Gaole Dai, Hao Liang, Ziyu Guo, Shilin Yan, Yulin Luo, et al. Unictokens: Boosting personalized understand- ing and generation via unified concept tokens.arXiv preprint arXiv:2505.14671, 2025. 3

work page arXiv 2025
[3]

Seeddream 4.0, 2025

ByteDance. Seeddream 4.0, 2025. 1

work page 2025
[4]

Xverse: Consistent multi-subject control of identity and semantic attributes via dit modulation

Bowen Chen, Mengyi Zhao, Haomiao Sun, Li Chen, Xu Wang, Kang Du, and Xinglong Wu. Xverse: Consistent multi-subject control of identity and semantic attributes via dit modulation.arXiv preprint arXiv:2506.21416, 2025. 5

work page arXiv 2025
[5]

Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling

Xiaokang Chen, Zhiyu Wu, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, and Chong Ruan. Janus- pro: Unified multimodal understanding and generation with data and model scaling.arXiv preprint arXiv:2501.17811,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Opengpt-4o-image: A com- prehensive dataset for advanced image generation and edit- ing.arXiv preprint arXiv:2509.24900, 2025

Zhihong Chen, Xuehai Bai, Yang Shi, Chaoyou Fu, Huanyu Zhang, Haotian Wang, Xiaoyan Sun, Zhang Zhang, Liang Wang, Yuanxing Zhang, et al. Opengpt-4o-image: A com- prehensive dataset for advanced image generation and edit- ing.arXiv preprint arXiv:2509.24900, 2025

work page arXiv 2025
[7]

Emerging Properties in Unified Multimodal Pretraining

Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, Guang Shi, and Haoqi Fan. Emerging properties in unified multimodal pretraining.arXiv preprint arXiv:2505.14683, 2025. 1, 3, 6, 7, 8, 12, 13

work page internal anchor Pith review Pith/arXiv arXiv 2025
[8]

Introducing gemini 2.5 flash image, our state-of- the-art image model, 2025

Google. Introducing gemini 2.5 flash image, our state-of- the-art image model, 2025. 1, 6, 7, 12

work page 2025
[9]

Musar: Exploring multi-subject cus- tomization from single-subject dataset via attention routing

Zinan Guo, Pengze Zhang, Yanze Wu, Chong Mou, Songtao Zhao, and Qian He. Musar: Exploring multi-subject cus- tomization from single-subject dataset via attention routing. arXiv preprint arXiv:2505.02823, 2025. 6

work page arXiv 2025
[10]

Instantfamily: Masked attention for zero-shot multi-id image generation.arXiv preprint arXiv:2404.19427, 2024

Chanran Kim, Jeongin Lee, Shichang Joung, Bongmo Kim, and Yeul-Min Baek. Instantfamily: Masked attention for zero-shot multi-id image generation.arXiv preprint arXiv:2404.19427, 2024. 3

work page arXiv 2024
[11]

Viescore: Towards explainable metrics for conditional image synthesis evaluation

Max Ku, Dongfu Jiang, Cong Wei, Xiang Yue, and Wenhu Chen. Viescore: Towards explainable metrics for conditional image synthesis evaluation. InProceedings of the 62nd An- nual Meeting of the Association for Computational Linguis- tics (Volume 1: Long Papers), pages 12268–12290, 2024. 6

work page 2024
[12]

Flux, 2024

Black Forest Labs. Flux, 2024. 5, 6

work page 2024
[13]

Flux.1 kontext: Flow matching for in-context image generation and editing in latent space,

Black Forest Labs, Stephen Batifol, Andreas Blattmann, Frederic Boesel, Saksham Consul, Cyril Diagne, Tim Dock- horn, Jack English, Zion English, Patrick Esser, Sumith Ku- lal, Kyle Lacey, Yam Levi, Cheng Li, Dominik Lorenz, Jonas M¨uller, Dustin Podell, Robin Rombach, Harry Saini, Axel Sauer, and Luke Smith. Flux.1 kontext: Flow matching for in-context i...

work page
[14]

Revealing single frame bias for video-and-language learning

Jie Lei, Tamara Berg, and Mohit Bansal. Revealing single frame bias for video-and-language learning. InProceedings of the 61st Annual Meeting of the Association for Computa- tional Linguistics (Volume 1: Long Papers), pages 487–507,

work page
[15]

Uniworld-v2: Reinforce image editing with diffusion negative-aware finetuning and mllm implicit feedback.arXiv preprint arXiv:2510.16888, 2025

Zongjian Li, Zheyuan Liu, Qihui Zhang, Bin Lin, Shenghai Yuan, Zhiyuan Yan, Yang Ye, Wangbo Yu, Yuwei Niu, and Li Yuan. Uniworld-v2: Reinforce image editing with diffusion negative-aware finetuning and mllm implicit feedback.arXiv preprint arXiv:2510.16888, 2025. 3, 6, 7

work page arXiv 2025
[16]

UniWorld-V1: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation

Bin Lin, Zongjian Li, Xinhua Cheng, Yuwei Niu, Yang Ye, Xianyi He, Shenghai Yuan, Wangbo Yu, Shaodong Wang, 9 Yunyang Ge, et al. Uniworld: High-resolution semantic en- coders for unified visual understanding and generation.arXiv preprint arXiv:2506.03147, 2025. 3

work page internal anchor Pith review Pith/arXiv arXiv 2025
[17]

Perceive anything: Recognize, explain, caption, and segment anything in images and videos.arXiv preprint arXiv:2506.05302, 2025

Weifeng Lin, Xinyu Wei, Ruichuan An, Tianhe Ren, Tingwei Chen, Renrui Zhang, Ziyu Guo, Wentao Zhang, Lei Zhang, and Hongsheng Li. Perceive anything: Recognize, explain, caption, and segment anything in images and videos.arXiv preprint arXiv:2506.05302, 2025. 1

work page arXiv 2025
[18]

Sota: spike-navigated optimal trans- port saliency region detection in composite-bias videos

Wenxuan Liu, Yao Deng, Kang Chen, Xian Zhong, Zhaofei Yu, and Tiejun Huang. Sota: spike-navigated optimal trans- port saliency region detection in composite-bias videos. In Proceedings of the Thirty-Fourth International Joint Confer- ence on Artificial Intelligence, 2025. 2

work page 2025
[19]

Motion-consistent representa- tion learning for uav-based action recognition.IEEE Trans- actions on Intelligent Transportation Systems, 2025

Wenxuan Liu, Xian Zhong, Yihan Dai, Xuemei Jia, Zheng Wang, and Shin’Ichi Satoh. Motion-consistent representa- tion learning for uav-based action recognition.IEEE Trans- actions on Intelligent Transportation Systems, 2025. 1

work page 2025
[20]

Hello gpt-4o, 2025

OpenAI. Hello gpt-4o, 2025. 6, 12

work page 2025
[21]

Introducing gpt-4.1 in the api, 2025

OpenAI. Introducing gpt-4.1 in the api, 2025. 6

work page 2025
[22]

Introducing 4o image generation, 2025

OpenAI. Introducing 4o image generation, 2025. 6, 7

work page 2025
[23]

DINOv2: Learning Robust Visual Features without Supervision

Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023. 4

work page internal anchor Pith review Pith/arXiv arXiv 2023
[24]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF inter- national conference on computer vision, pages 4195–4205,

work page
[25]

Dreambench++: A human-aligned bench- mark for personalized image generation.arXiv preprint arXiv:2406.16855, 2024

Yuang Peng, Yuxin Cui, Haomiao Tang, Zekun Qi, Runpei Dong, Jing Bai, Chunrui Han, Zheng Ge, Xiangyu Zhang, and Shu-Tao Xia. Dreambench++: A human-aligned bench- mark for personalized image generation.arXiv preprint arXiv:2406.16855, 2024. 5

work page arXiv 2024
[26]

Learning transferable visual models from natural language supervi- sion

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 4

work page 2021
[27]

Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation

Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 22500– 22510, 2023. 5

work page 2023
[28]

DualToken: Towards Unifying Visual Understanding and Generation with Dual Visual Vocabularies

Wei Song, Yuran Wang, Zijia Song, Yadong Li, Haoze Sun, Weipeng Chen, Zenan Zhou, Jianhua Xu, Jiaqi Wang, and Kaicheng Yu. Dualtoken: Towards unifying visual under- standing and generation with dual visual vocabularies.arXiv preprint arXiv:2503.14324, 2025. 3

work page internal anchor Pith review Pith/arXiv arXiv 2025
[29]

Ominicontrol: Minimal and univer- sal control for diffusion transformer

Zhenxiong Tan, Songhua Liu, Xingyi Yang, Qiaochu Xue, and Xinchao Wang. Ominicontrol: Minimal and univer- sal control for diffusion transformer. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 14940–14950, 2025. 3

work page 2025
[30]

Exploring the deep fusion of large language models and dif- fusion transformers for text-to-image synthesis

Bingda Tang, Boyang Zheng, Sayak Paul, and Saining Xie. Exploring the deep fusion of large language models and dif- fusion transformers for text-to-image synthesis. InProceed- ings of the Computer Vision and Pattern Recognition Con- ference, pages 28586–28595, 2025. 3

work page 2025
[31]

Video understanding with large language models: A survey.IEEE Transactions on Circuits and Sys- tems for Video Technology, 2025

Yunlong Tang, Jing Bi, Siting Xu, Luchuan Song, Susan Liang, Teng Wang, Daoan Zhang, Jie An, Jingyang Lin, Rongyi Zhu, et al. Video understanding with large language models: A survey.IEEE Transactions on Circuits and Sys- tems for Video Technology, 2025. 2

work page 2025
[32]

Simultaneous enhancement and noise suppression un- der complex illumination conditions.IEEE Transactions on Instrumentation and Measurement, 73:1–11, 2024

Jing Tao, You Li, Banglei Guan, Yang Shang, and Qifeng Yu. Simultaneous enhancement and noise suppression un- der complex illumination conditions.IEEE Transactions on Instrumentation and Measurement, 73:1–11, 2024. 1

work page 2024
[33]

Qwen3 technical report, 2025

Qwen Team. Qwen3 technical report, 2025. 5, 6, 12, 13

work page 2025
[34]

InstantID: Zero-shot Identity-Preserving Generation in Seconds

Qixun Wang, Xu Bai, Haofan Wang, Zekui Qin, Anthony Chen, Huaxia Li, Xu Tang, and Yao Hu. Instantid: Zero-shot identity-preserving generation in seconds.arXiv preprint arXiv:2401.07519, 2024. 3

work page internal anchor Pith review arXiv 2024
[35]

Devil is in details: Locality-aware 3d abdominal ct volume generation for self-supervised organ segmentation

Yuran Wang, Zhijing Wan, Yansheng Qiu, and Zheng Wang. Devil is in details: Locality-aware 3d abdominal ct volume generation for self-supervised organ segmentation. InPro- ceedings of the 32nd ACM International Conference on Mul- timedia, pages 10640–10648, 2024. 1

work page 2024
[36]

Qwen-image technical report,

Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, Yuxiang Chen, Zecheng Tang, Zekai Zhang, Zhengyi Wang, An Yang, Bowen Yu, Chen Cheng, Dayiheng Liu, De- qing Li, Hang Zhang, Hao Meng, Hu Wei, Jingyuan Ni, Kai Chen, Kuan Cao, Liang Peng, Lin Qu, Minggang Wu, Peng Wang, Shuting Yu, Tingk...

work page
[37]

OmniGen2: Towards Instruction-Aligned Multimodal Generation

Chenyuan Wu, Pengfei Zheng, Ruiran Yan, Shitao Xiao, Xin Luo, Yueze Wang, Wanli Li, Xiyan Jiang, Yexin Liu, Junjie Zhou, et al. Omnigen2: Exploration to advanced multimodal generation.arXiv preprint arXiv:2506.18871, 2025. 1, 3, 5, 6, 7, 8, 13

work page internal anchor Pith review Pith/arXiv arXiv 2025
[38]

Uso: Unified style and subject-driven generation via disentangled and re- ward learning.arXiv preprint arXiv:2508.18966, 2025

Shaojin Wu, Mengqi Huang, Yufeng Cheng, Wenxu Wu, Ji- ahe Tian, Yiming Luo, Fei Ding, and Qian He. Uso: Unified style and subject-driven generation via disentangled and re- ward learning.arXiv preprint arXiv:2508.18966, 2025. 6, 7

work page arXiv 2025
[39]

Less-to- more generalization: Unlocking more controllability by in-context generation

Shaojin Wu, Mengqi Huang, Wenxu Wu, Yufeng Cheng, Fei Ding, and Qian He. Less-to-more generalization: Unlocking more controllability by in-context generation.arXiv preprint arXiv:2504.02160, 2025. 1, 6, 7

work page arXiv 2025
[40]

Omnigen: Unified image genera- tion

Shitao Xiao, Yueze Wang, Junjie Zhou, Huaying Yuan, Xin- grun Xing, Ruiran Yan, Chaofan Li, Shuting Wang, Tiejun Huang, and Zheng Liu. Omnigen: Unified image genera- tion. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 13294–13304, 2025. 1, 3, 6

work page 2025
[41]

Show-o: One Single Transformer to Unify Multimodal Understanding and Generation

Jinheng Xie, Weijia Mao, Zechen Bai, David Junhao Zhang, Weihao Wang, Kevin Qinghong Lin, Yuchao Gu, Zhijie Chen, Zhenheng Yang, and Mike Zheng Shou. Show-o: One single transformer to unify multimodal understanding and generation.arXiv preprint arXiv:2408.12528, 2024. 3 10

work page internal anchor Pith review Pith/arXiv arXiv 2024
[42]

Show-o2: Improved Native Unified Multimodal Models

Jinheng Xie, Zhenheng Yang, and Mike Zheng Shou. Show- o2: Improved native unified multimodal models.arXiv preprint arXiv:2506.15564, 2025. 3

work page internal anchor Pith review Pith/arXiv arXiv 2025
[43]

IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models

Hu Ye, Jun Zhang, Sibo Liu, Xiao Han, and Wei Yang. Ip- adapter: Text compatible image prompt adapter for text-to- image diffusion models.arXiv preprint arXiv:2308.06721,

work page internal anchor Pith review Pith/arXiv arXiv
[44]

Echo-4o: Har- nessing the power of gpt-4o synthetic images for improved image generation, 2025

Junyan Ye, Dongzhi Jiang, Zihao Wang, Leqi Zhu, Zheng- hao Hu, Zilong Huang, Jun He, Zhiyuan Yan, Jinghua Yu, Hongsheng Li, Conghui He, and Weijia Li. Echo-4o: Har- nessing the power of gpt-4o synthetic images for improved image generation, 2025. 1, 3, 6, 7

work page 2025
[45]

Ipdreamer: Appearance-controllable 3d object generation with complex image prompts

Bohan Zeng, Shanglin Li, Yutang Feng, Ling Yang, Juan Zhang, Hong Li, Jiaming Liu, Conghui He, Wentao Zhang, Jianzhuang Liu, et al. Ipdreamer: Appearance-controllable 3d object generation with complex image prompts. InThe Thirteenth International Conference on Learning Represen- tations, 2024. 3

work page 2024
[46]

Holmes-vad: Towards unbiased and explainable video anomaly detection via multi-modal llm

Huaxin Zhang, Xiaohao Xu, Xiang Wang, Jialong Zuo, Chuchu Han, Xiaonan Huang, Changxin Gao, Yuehuan Wang, and Nong Sang. Holmes-vad: Towards unbiased and explainable video anomaly detection via multi-modal llm. arXiv preprint arXiv:2406.12235, 2024. 2

work page arXiv 2024
[47]

Cfbench: A comprehensive constraints- following benchmark for llms

Tao Zhang, Chenglin Zhu, Yanjun Shen, Wenjing Luo, Yan Zhang, Hao Liang, Fan Yang, Mingan Lin, Yujing Qiao, Weipeng Chen, et al. Cfbench: A comprehensive constraints- following benchmark for llms. InProceedings of the 63rd Annual Meeting of the Association for Computational Lin- guistics (Volume 1: Long Papers), pages 32926–32944,

work page
[48]

Ssr-encoder: Encoding selective subject representation for subject-driven generation

Yuxuan Zhang, Yiren Song, Jiaming Liu, Rui Wang, Jinpeng Yu, Hao Tang, Huaxia Li, Xu Tang, Yao Hu, Han Pan, et al. Ssr-encoder: Encoding selective subject representation for subject-driven generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8069–8078, 2024. 3

work page 2024
[49]

COM” denotes composition and “DIS

Zhi Zhang, Srishti Yadav, Fengze Han, and Ekaterina Shutova. Cross-modal information flow in multimodal large language models. InProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR), pages 19781–19791, 2025. 2, 3, 12 11 Scone: Bridging Composition and Distinction in Subject-Driven Image Generation via Unified Understa...

work page 2025