pith. the verified trust layer for science. sign in

arxiv: 2512.12675 · v2 · submitted 2025-12-14 · 💻 cs.CV · cs.AI

Scone: Bridging Composition and Distinction in Subject-Driven Image Generation via Unified Understanding-Generation Modeling

Pith reviewed 2026-05-16 22:36 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords subject-driven image generationmulti-subject compositionsubject distinctionunified understanding-generationsemantic bridgetwo-stage trainingattention-based maskingSconeEval benchmark
0
0 comments X p. Extension

The pith

Scone uses an understanding expert as a semantic bridge to let generation models handle both multi-subject composition and correct distinction without interference.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Subject-driven image generation has improved at combining several subjects but still fails when prompts contain multiple similar candidates and the model must pick the right one. Scone introduces a single architecture that trains an understanding module and a generation module together so the understanding module can pass semantic signals to the generator. The method first teaches composition, then adds distinction through semantic alignment and attention masking that prevents cross-subject leakage. Experiments on two benchmarks show the resulting model outperforms prior open-source systems at both tasks while preserving subject identity.

Core claim

Scone is a unified understanding-generation model for subject-driven image generation. The understanding expert functions as a semantic bridge that conveys information to the generation expert, enabling it to preserve subject identity while minimizing interference among multiple subjects. Training proceeds in two stages: the first stage learns composition, and the second stage strengthens distinction via semantic alignment and attention-based masking. The approach is evaluated on the new SconeEval benchmark and on existing benchmarks, where it surpasses prior open-source models in both composition and distinction metrics.

What carries the argument

The understanding expert acting as a semantic bridge inside a unified understanding-generation architecture, trained with a two-stage schedule of composition learning followed by semantic alignment and attention-based masking.

If this is right

  • Multi-subject prompts become reliable for realistic scene generation without manual subject isolation.
  • Subject identity is preserved across varying contexts while avoiding leakage from other reference images.
  • The SconeEval benchmark provides a standardized way to measure both composition accuracy and distinction correctness.
  • Open-source models can now be fine-tuned with the same two-stage recipe to close the gap with closed models on complex subject tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The semantic-bridge pattern could transfer to video or 3D generation where temporal or spatial consistency across subjects is required.
  • Attention masking during the distinction stage may generalize to other conditional generation settings that need selective focus on reference signals.
  • Combining the method with larger pretrained understanding models could further reduce the data needed for the distinction stage.

Load-bearing premise

The two-stage training with semantic alignment and attention-based masking can strengthen distinction without lowering composition quality or creating new interference between subjects.

What would settle it

A controlled test set of prompts containing two visually similar subjects where the model after distinction training either swaps identities or produces lower composition fidelity than the composition-only checkpoint.

Figures

Figures reproduced from arXiv: 2512.12675 by Bohan Zeng, Chengzhuo Tong, Hao Liang, Wentao Zhang, Wenxuan Liu, Xiaochen Ma, Yang Shi, Yuanxing Zhang, Yuran Wang.

Figure 1
Figure 1. Figure 1: The distinction problem and challenges. (a) Prob [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Our motivation. (a) visualizes the early similarity between image token hidden states from the understanding and generation experts and text token hidden states within the unified model, showing that the former attends to semantic regions while the latter is less sensitive. (b) illustrates the collaboration between the understanding and generation experts within the unified model through end-to-end trainin… view at source ↗
Figure 3
Figure 3. Figure 3: Understanding bridge strategy. Step 1: Understand [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Overview of our SconeEval benchmark. Char”: character, Obj”: object, “Sce”: scene. SconeEval evaluates target subject identification and generation in complex visual contexts. It provides 409 test cases across three domains with 19 case types and 6 subtasks, covering composition, distinction, and distinction & composition tasks [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Multi-candidate editing in our SconeEval bench [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative comparison of existing models on OmniContext [ [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Qualitative comparison of existing models on SconeEval benchmark. [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Stability measured by the standard deviation of [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Prompt for distinction scoring in SconeEval bench [PITH_FULL_IMAGE:figures/full_fig_p012_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Representative similarity and masked images for each layer group. [PITH_FULL_IMAGE:figures/full_fig_p013_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Limitation of our Scone. E. Limitation Our Scone still exhibits a common limitation found in exist￾ing methods: unrealistic interaction. As shown in [PITH_FULL_IMAGE:figures/full_fig_p013_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Examples of synthesized data with 3 input images. [PITH_FULL_IMAGE:figures/full_fig_p014_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Examples of synthesized data with 4 input images. [PITH_FULL_IMAGE:figures/full_fig_p014_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Data filtering for refined single-candidate data. (a) Prompt for training data filtering. [PITH_FULL_IMAGE:figures/full_fig_p015_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Multi-candidate single-subject data construction. (a) Prompt for instruction construction. [PITH_FULL_IMAGE:figures/full_fig_p015_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Multi-candidate multi-subject data construction. (a) Prompts for subject replacement. [PITH_FULL_IMAGE:figures/full_fig_p016_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Comparison between two-step decoupling and direct strategies for instruction construction. [PITH_FULL_IMAGE:figures/full_fig_p017_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Prompts for instruction construction in SconeEval benchmark. (a) Prompt for subject identification. [PITH_FULL_IMAGE:figures/full_fig_p017_18.png] view at source ↗
read the original abstract

Subject-driven image generation has advanced from single- to multi-subject composition, while neglecting distinction, the ability to distinguish and generate the correct subject when inputs contain multiple candidates. This limitation restricts effectiveness in complex, realistic visual settings. We propose Scone, a unified understanding-generation method that integrates composition and distinction. Scone enables the understanding expert to act as a semantic bridge, conveying semantic information and guiding the generation expert to preserve subject identity while minimizing interference. A two-stage training scheme first learns composition, then enhances distinction through semantic alignment and attention-based masking. We also introduce SconeEval, a benchmark for evaluating both composition and distinction across diverse scenarios. Experiments demonstrate that Scone outperforms existing open-source models in composition and distinction tasks on two benchmarks. Our model, benchmark, and training data are available at: https://github.com/Ryann-Ran/Scone.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces Scone, a unified understanding-generation architecture for subject-driven image generation that jointly addresses composition (integrating multiple subjects) and distinction (correctly identifying and rendering specific subjects amid candidates to minimize interference). The core mechanism positions the understanding expert as a semantic bridge that guides the generation expert. Training proceeds in two stages: an initial composition-focused stage followed by distinction enhancement via semantic alignment and attention-based masking. The authors release a new benchmark SconeEval and report that Scone outperforms existing open-source models on composition and distinction tasks across two benchmarks. Model weights, benchmark, and training data are made publicly available.

Significance. If the performance gains are reproducible and attributable to the proposed mechanisms rather than training volume, the work would meaningfully advance multi-subject generation by explicitly modeling distinction, a previously under-addressed capability needed for realistic scenes. The conceptual framing of understanding as a semantic bridge and the public release of code, benchmark, and data constitute clear strengths that support reproducibility and follow-on research.

major comments (2)
  1. [§4 (Experiments)] §4 (Experiments) and abstract: Performance claims state that Scone outperforms open-source models, yet no details are supplied on experimental controls, statistical significance, variance across runs, or precise metric values and baseline implementations. This absence prevents verification of the central empirical result.
  2. [Training section (§3.2)] Training section (likely §3.2): The claim that stage-2 semantic alignment plus attention-based masking specifically improves distinction while preserving composition rests on the two-stage scheme. No component-wise ablations, before/after interference metrics (e.g., subject-swap error rates), or controls for additional data volume are reported, leaving open the possibility that observed gains arise simply from continued training on multi-subject examples rather than the proposed mechanisms.
minor comments (2)
  1. [Abstract] Abstract: The statement 'outperforms ... on two benchmarks' should explicitly name the second benchmark in addition to SconeEval for immediate clarity.
  2. [Throughout] Notation and terminology: Ensure consistent capitalization and definition of 'understanding expert' and 'generation expert' on first use and in all figure captions.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the constructive feedback. We appreciate the referee's emphasis on reproducibility and the need to isolate the contributions of our proposed mechanisms. We will revise the manuscript to provide the requested details and ablations.

read point-by-point responses
  1. Referee: [§4 (Experiments)] §4 (Experiments) and abstract: Performance claims state that Scone outperforms open-source models, yet no details are supplied on experimental controls, statistical significance, variance across runs, or precise metric values and baseline implementations. This absence prevents verification of the central empirical result.

    Authors: We agree that the current presentation lacks sufficient detail for independent verification. In the revised manuscript, we will expand §4 and the abstract to report: exact numerical metric values with standard deviations across three random seeds, baseline implementation details (including code references and hyperparameter settings), statistical significance via paired t-tests, and full experimental controls such as evaluation protocols and data splits. These additions will directly address the verification concern. revision: yes

  2. Referee: [Training section (§3.2)] Training section (likely §3.2): The claim that stage-2 semantic alignment plus attention-based masking specifically improves distinction while preserving composition rests on the two-stage scheme. No component-wise ablations, before/after interference metrics (e.g., subject-swap error rates), or controls for additional data volume are reported, leaving open the possibility that observed gains arise simply from continued training on multi-subject examples rather than the proposed mechanisms.

    Authors: We acknowledge that the manuscript does not currently include the requested ablations or controls. In the revision, we will add a dedicated ablation subsection in §3.2 and §4 showing: (i) component-wise results with and without semantic alignment and attention-based masking, (ii) before/after subject-swap error rates and other distinction-specific interference metrics, and (iii) a control experiment training for equivalent total steps on the same multi-subject data but omitting the distinction-specific losses. These results will demonstrate that the observed gains are attributable to the proposed mechanisms rather than data volume or continued training alone. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical claims rest on benchmarks

full rationale

The paper proposes a two-stage training scheme (composition learning followed by semantic alignment and attention-based masking) for a unified model and supports its claims via experimental results on SconeEval and other benchmarks. No equations, derivations, fitted parameters presented as predictions, self-definitional constructs, or load-bearing self-citations appear in the provided text. The semantic-bridge role is asserted through the training procedure and performance comparisons rather than reducing to its own inputs by construction. This is a standard empirical ML paper with independent experimental validation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities beyond standard deep learning assumptions; the 'understanding expert' and 'generation expert' are presented as architectural components rather than new physical entities.

pith-pipeline@v0.9.0 · 5476 in / 1081 out tokens · 31461 ms · 2026-05-16T22:36:39.529349+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 6 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Edit-Compass & EditReward-Compass: A Unified Benchmark for Image Editing and Reward Modeling

    cs.CV 2026-05 unverdicted novelty 7.0

    Edit-Compass and EditReward-Compass are new unified benchmarks for fine-grained image editing evaluation and realistic reward modeling in reinforcement learning optimization.

  2. UniCustom: Unified Visual Conditioning for Multi-Reference Image Generation

    cs.CV 2026-05 unverdicted novelty 7.0

    UniCustom fuses ViT and VAE features before VLM encoding and uses two-stage training plus slot-wise regularization to improve subject consistency in multi-reference diffusion-based image generation.

  3. UniCustom: Unified Visual Conditioning for Multi-Reference Image Generation

    cs.CV 2026-05 unverdicted novelty 6.0

    A unified visual conditioning approach fuses semantic and appearance features before VLM processing, with two-stage training and slot-wise regularization, to improve consistency in multi-reference image generation.

  4. HiDream-O1-Image: A Natively Unified Image Generative Foundation Model with Pixel-level Unified Transformer

    cs.CV 2026-05 unverdicted novelty 6.0

    A pixel-space Diffusion Transformer with Unified Transformer architecture unifies image generation, editing, and personalization in an end-to-end model that maps all inputs to a shared token space and scales from 8B t...

  5. Reinforcement-Guided Synthetic Data Generation for Privacy-Sensitive Identity Recognition

    cs.CV 2026-04 unverdicted novelty 5.0

    A reinforcement learning approach adapts general generative models to produce synthetic data that boosts identity recognition accuracy and generalization under privacy constraints.

  6. OpenWorldLib: A Unified Codebase and Definition of Advanced World Models

    cs.CV 2026-04 unverdicted novelty 4.0

    OpenWorldLib offers a standardized codebase and definition for world models that combine perception, interaction, and memory to understand and predict the world.

Reference graph

Works this paper leans on

49 extracted references · 49 canonical work pages · cited by 5 Pith papers · 10 internal anchors

  1. [1]

    Mc-llava: Multi-concept personalized vision-language model.arXiv preprint arXiv:2411.11706, 2024

    Ruichuan An, Sihan Yang, Ming Lu, Renrui Zhang, Kai Zeng, Yulin Luo, Jiajun Cao, Hao Liang, Ying Chen, Qi She, et al. Mc-llava: Multi-concept personalized vision-language model.arXiv preprint arXiv:2411.11706, 2024. 1

  2. [2]

    Unictokens: Boosting personalized understand- ing and generation via unified concept tokens.arXiv preprint arXiv:2505.14671, 2025

    Ruichuan An, Sihan Yang, Renrui Zhang, Zijun Shen, Ming Lu, Gaole Dai, Hao Liang, Ziyu Guo, Shilin Yan, Yulin Luo, et al. Unictokens: Boosting personalized understand- ing and generation via unified concept tokens.arXiv preprint arXiv:2505.14671, 2025. 3

  3. [3]

    Seeddream 4.0, 2025

    ByteDance. Seeddream 4.0, 2025. 1

  4. [4]

    Xverse: Consistent multi-subject control of identity and semantic attributes via dit modulation

    Bowen Chen, Mengyi Zhao, Haomiao Sun, Li Chen, Xu Wang, Kang Du, and Xinglong Wu. Xverse: Consistent multi-subject control of identity and semantic attributes via dit modulation.arXiv preprint arXiv:2506.21416, 2025. 5

  5. [5]

    Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling

    Xiaokang Chen, Zhiyu Wu, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, and Chong Ruan. Janus- pro: Unified multimodal understanding and generation with data and model scaling.arXiv preprint arXiv:2501.17811,

  6. [6]

    Opengpt-4o-image: A com- prehensive dataset for advanced image generation and edit- ing.arXiv preprint arXiv:2509.24900, 2025

    Zhihong Chen, Xuehai Bai, Yang Shi, Chaoyou Fu, Huanyu Zhang, Haotian Wang, Xiaoyan Sun, Zhang Zhang, Liang Wang, Yuanxing Zhang, et al. Opengpt-4o-image: A com- prehensive dataset for advanced image generation and edit- ing.arXiv preprint arXiv:2509.24900, 2025

  7. [7]

    Emerging Properties in Unified Multimodal Pretraining

    Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, Guang Shi, and Haoqi Fan. Emerging properties in unified multimodal pretraining.arXiv preprint arXiv:2505.14683, 2025. 1, 3, 6, 7, 8, 12, 13

  8. [8]

    Introducing gemini 2.5 flash image, our state-of- the-art image model, 2025

    Google. Introducing gemini 2.5 flash image, our state-of- the-art image model, 2025. 1, 6, 7, 12

  9. [9]

    Musar: Exploring multi-subject cus- tomization from single-subject dataset via attention routing

    Zinan Guo, Pengze Zhang, Yanze Wu, Chong Mou, Songtao Zhao, and Qian He. Musar: Exploring multi-subject cus- tomization from single-subject dataset via attention routing. arXiv preprint arXiv:2505.02823, 2025. 6

  10. [10]

    Instantfamily: Masked attention for zero-shot multi-id image generation.arXiv preprint arXiv:2404.19427, 2024

    Chanran Kim, Jeongin Lee, Shichang Joung, Bongmo Kim, and Yeul-Min Baek. Instantfamily: Masked attention for zero-shot multi-id image generation.arXiv preprint arXiv:2404.19427, 2024. 3

  11. [11]

    Viescore: Towards explainable metrics for conditional image synthesis evaluation

    Max Ku, Dongfu Jiang, Cong Wei, Xiang Yue, and Wenhu Chen. Viescore: Towards explainable metrics for conditional image synthesis evaluation. InProceedings of the 62nd An- nual Meeting of the Association for Computational Linguis- tics (Volume 1: Long Papers), pages 12268–12290, 2024. 6

  12. [12]

    Flux, 2024

    Black Forest Labs. Flux, 2024. 5, 6

  13. [13]

    Flux.1 kontext: Flow matching for in-context image generation and editing in latent space,

    Black Forest Labs, Stephen Batifol, Andreas Blattmann, Frederic Boesel, Saksham Consul, Cyril Diagne, Tim Dock- horn, Jack English, Zion English, Patrick Esser, Sumith Ku- lal, Kyle Lacey, Yam Levi, Cheng Li, Dominik Lorenz, Jonas M¨uller, Dustin Podell, Robin Rombach, Harry Saini, Axel Sauer, and Luke Smith. Flux.1 kontext: Flow matching for in-context i...

  14. [14]

    Revealing single frame bias for video-and-language learning

    Jie Lei, Tamara Berg, and Mohit Bansal. Revealing single frame bias for video-and-language learning. InProceedings of the 61st Annual Meeting of the Association for Computa- tional Linguistics (Volume 1: Long Papers), pages 487–507,

  15. [15]

    Uniworld-v2: Reinforce image editing with diffusion negative-aware finetuning and mllm implicit feedback.arXiv preprint arXiv:2510.16888, 2025

    Zongjian Li, Zheyuan Liu, Qihui Zhang, Bin Lin, Shenghai Yuan, Zhiyuan Yan, Yang Ye, Wangbo Yu, Yuwei Niu, and Li Yuan. Uniworld-v2: Reinforce image editing with diffusion negative-aware finetuning and mllm implicit feedback.arXiv preprint arXiv:2510.16888, 2025. 3, 6, 7

  16. [16]

    UniWorld-V1: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation

    Bin Lin, Zongjian Li, Xinhua Cheng, Yuwei Niu, Yang Ye, Xianyi He, Shenghai Yuan, Wangbo Yu, Shaodong Wang, 9 Yunyang Ge, et al. Uniworld: High-resolution semantic en- coders for unified visual understanding and generation.arXiv preprint arXiv:2506.03147, 2025. 3

  17. [17]

    Perceive anything: Recognize, explain, caption, and segment anything in images and videos.arXiv preprint arXiv:2506.05302, 2025

    Weifeng Lin, Xinyu Wei, Ruichuan An, Tianhe Ren, Tingwei Chen, Renrui Zhang, Ziyu Guo, Wentao Zhang, Lei Zhang, and Hongsheng Li. Perceive anything: Recognize, explain, caption, and segment anything in images and videos.arXiv preprint arXiv:2506.05302, 2025. 1

  18. [18]

    Sota: spike-navigated optimal trans- port saliency region detection in composite-bias videos

    Wenxuan Liu, Yao Deng, Kang Chen, Xian Zhong, Zhaofei Yu, and Tiejun Huang. Sota: spike-navigated optimal trans- port saliency region detection in composite-bias videos. In Proceedings of the Thirty-Fourth International Joint Confer- ence on Artificial Intelligence, 2025. 2

  19. [19]

    Motion-consistent representa- tion learning for uav-based action recognition.IEEE Trans- actions on Intelligent Transportation Systems, 2025

    Wenxuan Liu, Xian Zhong, Yihan Dai, Xuemei Jia, Zheng Wang, and Shin’Ichi Satoh. Motion-consistent representa- tion learning for uav-based action recognition.IEEE Trans- actions on Intelligent Transportation Systems, 2025. 1

  20. [20]

    Hello gpt-4o, 2025

    OpenAI. Hello gpt-4o, 2025. 6, 12

  21. [21]

    Introducing gpt-4.1 in the api, 2025

    OpenAI. Introducing gpt-4.1 in the api, 2025. 6

  22. [22]

    Introducing 4o image generation, 2025

    OpenAI. Introducing 4o image generation, 2025. 6, 7

  23. [23]

    DINOv2: Learning Robust Visual Features without Supervision

    Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023. 4

  24. [24]

    Scalable diffusion models with transformers

    William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF inter- national conference on computer vision, pages 4195–4205,

  25. [25]

    Dreambench++: A human-aligned bench- mark for personalized image generation.arXiv preprint arXiv:2406.16855, 2024

    Yuang Peng, Yuxin Cui, Haomiao Tang, Zekun Qi, Runpei Dong, Jing Bai, Chunrui Han, Zheng Ge, Xiangyu Zhang, and Shu-Tao Xia. Dreambench++: A human-aligned bench- mark for personalized image generation.arXiv preprint arXiv:2406.16855, 2024. 5

  26. [26]

    Learning transferable visual models from natural language supervi- sion

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 4

  27. [27]

    Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation

    Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 22500– 22510, 2023. 5

  28. [28]

    DualToken: Towards Unifying Visual Understanding and Generation with Dual Visual Vocabularies

    Wei Song, Yuran Wang, Zijia Song, Yadong Li, Haoze Sun, Weipeng Chen, Zenan Zhou, Jianhua Xu, Jiaqi Wang, and Kaicheng Yu. Dualtoken: Towards unifying visual under- standing and generation with dual visual vocabularies.arXiv preprint arXiv:2503.14324, 2025. 3

  29. [29]

    Ominicontrol: Minimal and univer- sal control for diffusion transformer

    Zhenxiong Tan, Songhua Liu, Xingyi Yang, Qiaochu Xue, and Xinchao Wang. Ominicontrol: Minimal and univer- sal control for diffusion transformer. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 14940–14950, 2025. 3

  30. [30]

    Exploring the deep fusion of large language models and dif- fusion transformers for text-to-image synthesis

    Bingda Tang, Boyang Zheng, Sayak Paul, and Saining Xie. Exploring the deep fusion of large language models and dif- fusion transformers for text-to-image synthesis. InProceed- ings of the Computer Vision and Pattern Recognition Con- ference, pages 28586–28595, 2025. 3

  31. [31]

    Video understanding with large language models: A survey.IEEE Transactions on Circuits and Sys- tems for Video Technology, 2025

    Yunlong Tang, Jing Bi, Siting Xu, Luchuan Song, Susan Liang, Teng Wang, Daoan Zhang, Jie An, Jingyang Lin, Rongyi Zhu, et al. Video understanding with large language models: A survey.IEEE Transactions on Circuits and Sys- tems for Video Technology, 2025. 2

  32. [32]

    Simultaneous enhancement and noise suppression un- der complex illumination conditions.IEEE Transactions on Instrumentation and Measurement, 73:1–11, 2024

    Jing Tao, You Li, Banglei Guan, Yang Shang, and Qifeng Yu. Simultaneous enhancement and noise suppression un- der complex illumination conditions.IEEE Transactions on Instrumentation and Measurement, 73:1–11, 2024. 1

  33. [33]

    Qwen3 technical report, 2025

    Qwen Team. Qwen3 technical report, 2025. 5, 6, 12, 13

  34. [34]

    InstantID: Zero-shot Identity-Preserving Generation in Seconds

    Qixun Wang, Xu Bai, Haofan Wang, Zekui Qin, Anthony Chen, Huaxia Li, Xu Tang, and Yao Hu. Instantid: Zero-shot identity-preserving generation in seconds.arXiv preprint arXiv:2401.07519, 2024. 3

  35. [35]

    Devil is in details: Locality-aware 3d abdominal ct volume generation for self-supervised organ segmentation

    Yuran Wang, Zhijing Wan, Yansheng Qiu, and Zheng Wang. Devil is in details: Locality-aware 3d abdominal ct volume generation for self-supervised organ segmentation. InPro- ceedings of the 32nd ACM International Conference on Mul- timedia, pages 10640–10648, 2024. 1

  36. [36]

    Qwen-image technical report,

    Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, Yuxiang Chen, Zecheng Tang, Zekai Zhang, Zhengyi Wang, An Yang, Bowen Yu, Chen Cheng, Dayiheng Liu, De- qing Li, Hang Zhang, Hao Meng, Hu Wei, Jingyuan Ni, Kai Chen, Kuan Cao, Liang Peng, Lin Qu, Minggang Wu, Peng Wang, Shuting Yu, Tingk...

  37. [37]

    OmniGen2: Towards Instruction-Aligned Multimodal Generation

    Chenyuan Wu, Pengfei Zheng, Ruiran Yan, Shitao Xiao, Xin Luo, Yueze Wang, Wanli Li, Xiyan Jiang, Yexin Liu, Junjie Zhou, et al. Omnigen2: Exploration to advanced multimodal generation.arXiv preprint arXiv:2506.18871, 2025. 1, 3, 5, 6, 7, 8, 13

  38. [38]

    Uso: Unified style and subject-driven generation via disentangled and re- ward learning.arXiv preprint arXiv:2508.18966, 2025

    Shaojin Wu, Mengqi Huang, Yufeng Cheng, Wenxu Wu, Ji- ahe Tian, Yiming Luo, Fei Ding, and Qian He. Uso: Unified style and subject-driven generation via disentangled and re- ward learning.arXiv preprint arXiv:2508.18966, 2025. 6, 7

  39. [39]

    Less-to- more generalization: Unlocking more controllability by in-context generation

    Shaojin Wu, Mengqi Huang, Wenxu Wu, Yufeng Cheng, Fei Ding, and Qian He. Less-to-more generalization: Unlocking more controllability by in-context generation.arXiv preprint arXiv:2504.02160, 2025. 1, 6, 7

  40. [40]

    Omnigen: Unified image genera- tion

    Shitao Xiao, Yueze Wang, Junjie Zhou, Huaying Yuan, Xin- grun Xing, Ruiran Yan, Chaofan Li, Shuting Wang, Tiejun Huang, and Zheng Liu. Omnigen: Unified image genera- tion. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 13294–13304, 2025. 1, 3, 6

  41. [41]

    Show-o: One Single Transformer to Unify Multimodal Understanding and Generation

    Jinheng Xie, Weijia Mao, Zechen Bai, David Junhao Zhang, Weihao Wang, Kevin Qinghong Lin, Yuchao Gu, Zhijie Chen, Zhenheng Yang, and Mike Zheng Shou. Show-o: One single transformer to unify multimodal understanding and generation.arXiv preprint arXiv:2408.12528, 2024. 3 10

  42. [42]

    Show-o2: Improved Native Unified Multimodal Models

    Jinheng Xie, Zhenheng Yang, and Mike Zheng Shou. Show- o2: Improved native unified multimodal models.arXiv preprint arXiv:2506.15564, 2025. 3

  43. [43]

    IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models

    Hu Ye, Jun Zhang, Sibo Liu, Xiao Han, and Wei Yang. Ip- adapter: Text compatible image prompt adapter for text-to- image diffusion models.arXiv preprint arXiv:2308.06721,

  44. [44]

    Echo-4o: Har- nessing the power of gpt-4o synthetic images for improved image generation, 2025

    Junyan Ye, Dongzhi Jiang, Zihao Wang, Leqi Zhu, Zheng- hao Hu, Zilong Huang, Jun He, Zhiyuan Yan, Jinghua Yu, Hongsheng Li, Conghui He, and Weijia Li. Echo-4o: Har- nessing the power of gpt-4o synthetic images for improved image generation, 2025. 1, 3, 6, 7

  45. [45]

    Ipdreamer: Appearance-controllable 3d object generation with complex image prompts

    Bohan Zeng, Shanglin Li, Yutang Feng, Ling Yang, Juan Zhang, Hong Li, Jiaming Liu, Conghui He, Wentao Zhang, Jianzhuang Liu, et al. Ipdreamer: Appearance-controllable 3d object generation with complex image prompts. InThe Thirteenth International Conference on Learning Represen- tations, 2024. 3

  46. [46]

    Holmes-vad: Towards unbiased and explainable video anomaly detection via multi-modal llm

    Huaxin Zhang, Xiaohao Xu, Xiang Wang, Jialong Zuo, Chuchu Han, Xiaonan Huang, Changxin Gao, Yuehuan Wang, and Nong Sang. Holmes-vad: Towards unbiased and explainable video anomaly detection via multi-modal llm. arXiv preprint arXiv:2406.12235, 2024. 2

  47. [47]

    Cfbench: A comprehensive constraints- following benchmark for llms

    Tao Zhang, Chenglin Zhu, Yanjun Shen, Wenjing Luo, Yan Zhang, Hao Liang, Fan Yang, Mingan Lin, Yujing Qiao, Weipeng Chen, et al. Cfbench: A comprehensive constraints- following benchmark for llms. InProceedings of the 63rd Annual Meeting of the Association for Computational Lin- guistics (Volume 1: Long Papers), pages 32926–32944,

  48. [48]

    Ssr-encoder: Encoding selective subject representation for subject-driven generation

    Yuxuan Zhang, Yiren Song, Jiaming Liu, Rui Wang, Jinpeng Yu, Hao Tang, Huaxia Li, Xu Tang, Yao Hu, Han Pan, et al. Ssr-encoder: Encoding selective subject representation for subject-driven generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8069–8078, 2024. 3

  49. [49]

    COM” denotes composition and “DIS

    Zhi Zhang, Srishti Yadav, Fengze Han, and Ekaterina Shutova. Cross-modal information flow in multimodal large language models. InProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR), pages 19781–19791, 2025. 2, 3, 12 11 Scone: Bridging Composition and Distinction in Subject-Driven Image Generation via Unified Understa...