pith. sign in

arxiv: 2511.15197 · v2 · submitted 2025-11-19 · 💻 cs.CV

Insert In Style: A Zero-Shot Generative Framework for Harmonious Cross-Domain Object Composition

Pith reviewed 2026-05-17 21:13 UTC · model grok-4.3

classification 💻 cs.CV
keywords zero-shot object compositioncross-domain image generationmasked attentiongenerative frameworksimage harmonizationstyle transferreference-based insertiondisentangled representations
0
0 comments X

The pith

A zero-shot generative model inserts reference objects into stylized backgrounds while preserving their identity and achieving visual harmony.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to create a practical way to compose a reference object image into a background scene that has a different style, such as placing a real photo of a chair into a cartoon room. Existing solutions either produce low-quality blends or require expensive fine-tuning for each new object. The authors propose a single model that handles this zero-shot by using a multi-stage training process and a masked attention design to keep identity, style, and composition separate. This would matter if true because it removes the need for custom training each time, allowing fast generation of coherent composites from any reference and any style. The model is trained on a specially curated set of 115 thousand examples with human feedback to ensure quality.

Core claim

The framework demonstrates that disentangling identity, style, and composition representations through a multi-stage training protocol and a specialized masked-attention architecture enables high-fidelity cross-domain object composition in a zero-shot manner. Supported by a prior preservation objective and trained on a 115k sample dataset created via large-scale generation followed by iterative human-in-the-loop filtering, the model performs harmonious insertions without text prompts or per-subject optimization, outperforming prior methods on identity and style metrics as validated by user studies.

What carries the argument

Multi-stage training protocol combined with masked-attention architecture that enforces separation of identity, style, and composition representations during generation.

If this is right

  • The model requires no per-subject fine-tuning or text prompts for operation.
  • It achieves state-of-the-art results on both quantitative identity and style metrics.
  • User studies confirm superior visual quality and harmony in the composed images.
  • A new public benchmark is provided for evaluating stylized composition tasks.
  • The approach generalizes to diverse unseen reference objects and target styles.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the disentanglement succeeds, the same architecture could be adapted for inserting objects into video frames by processing each frame consistently.
  • Designers could use this to quickly prototype product placements in various artistic styles without additional model training.
  • Future work might test whether the method scales to more complex scenes involving multiple objects or 3D references.
  • The human-in-the-loop data curation process could be applied to other generative tasks to improve dataset quality.

Load-bearing premise

The multi-stage training and masked attention successfully separate identity, style, and composition without leaving any interfering overlaps that affect output quality on new examples.

What would settle it

Running the model on the new benchmark and finding that its identity preservation scores fall below those of a per-subject fine-tuned generator, or that users in blind tests prefer outputs from existing methods for overall harmony.

Figures

Figures reproduced from arXiv: 2511.15197 by Kunal Swami, Pranav Adlinge, Raghu Vamsi Chittersu, Yuvraj Singh Rathore.

Figure 1
Figure 1. Figure 1: Insert In Style: Zero-Shot Cross-Domain Composition. (Rows 1-2) Comparison with the state-of-the-art cross-domain method AIComposer [23]. AIComposer’s “blend-then-refine” approach corrupts object identity by misapplying background features. Insert In Style consistently generates a high-fidelity subject that is perfectly harmonized with the scene style. (Row 3) We demonstrate Insert In Style’s versatile gen… view at source ↗
Figure 2
Figure 2. Figure 2: Insert In Style generalizes across in-domain and cross￾domain tasks. Top (In-domain): The cross-domain specialist method AIComposer [23] incorrectly harmonizes the object. Our method maintains high fidelity, competitive with the in-domain specialist method DreamFuse [15]. Bottom (Cross-domain): DreamFuse [15] fails with a style mismatch, while AIComposer’s [23] harmonization corrupts object fidelity by inc… view at source ↗
Figure 3
Figure 3. Figure 3: Dataset Pipeline. (a) Generation: We create a large-scale, diverse raw corpus by applying a mix of state-of-the-art stylization methods (FLUX.1-Kontext [20], CSGO [51], and CAST [54]). (b) Filtering: Our raw dataset is then refined by our rigorous two-stage filtering process. The Identity Consistency filter prunes samples with semantic drift in the subject region, while the Style Coherence filter removes a… view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative samples from our Insert In Style Dataset. Spanning 100k samples and 1, 140 unique styles, it is the largest-scale corpus for this task. Each <Subject, Composite, Stylized Composite> triplet provides the strong, aligned supervision required to train robust, cross-domain insertion models [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Our multi-stage training protocol on a DiT backbone (a). Stages 1 (b) and 2 (c) are trained in parallel to independently learn object and style encoding. Stage-3 (d) learns composition by assembling these frozen branches, guided by our Structural Mask Attention (e). In each Transformer block, all four token sequences are jointly processed. The QKV matrices for the new con￾ditional branches are computed usi… view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative comparison with state-of-the-art in-domain and cross-domain baselines. In-domain methods [6, 15] produce jarring style mismatches, failing to generalize. Cross-domain methods [23, 27, 33] corrupt the subject’s identity and fidelity. In contrast, Insert In Style consistently achieves a superior balance, producing results that are both high-fidelity and aesthetically harmonious [PITH_FULL_IMAGE:… view at source ↗
Figure 7
Figure 7. Figure 7: User study. In a randomized and blind comparative study, Insert In Style was strongly preferred for “Content Preser￾vation and Style Harmony”, and “Overall Aesthetic Quality”. achieves a superior balance, producing results that are both high-fidelity and aesthetically harmonious. 5.4. User Study To evaluate perceptual quality, we conducted a compara￾tive user study 33 participants. Participants were shown … view at source ↗
Figure 8
Figure 8. Figure 8: Qualitative results of the ablation study on our multi [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗
read the original abstract

Reference-based object composition involves integrating foreground reference image with background scene to produce harmonious fused image. This task becomes particularly challenging in cross-domain scenarios, where models must balance preserving the reference object's identity while harmonizing them to match stylized environments. This under-explored problem is currently split between practical "blenders" that lack generative fidelity and "generators" that require impractical, per-subject online finetuning. In this work, we introduce Insert In Style, the first zero-shot generative framework that is both practical and high-fidelity. Our core contribution is a unified framework with two key innovations: (i) a novel multi-stage training protocol that disentangles representations for identity, style, and composition, and (ii) a specialized masked-attention architecture that surgically enforces this disentanglement during generation (iii) A prior preservation objective that keeps learned identity and style priors intact. By design, this approach mitigates concept interference typical in unified-attention architectures while ensuring robust generalization across diverse references and styles. Our framework is trained on a new 115k sample dataset, curated from a novel data pipeline. This pipeline couples large-scale generation with a rigorous, iterative human-in-the-loop filtering process to ensure both high-fidelity semantic identity and style coherence. Unlike prior work, our model is truly zero-shot and requires no text prompts. We also introduce a new public benchmark for stylized composition. We demonstrate state-of-the-art performance, significantly outperforming existing methods on both identity and style metrics, a result strongly corroborated by user studies.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces Insert In Style, a zero-shot generative framework for reference-based object composition in cross-domain scenarios. It proposes a multi-stage training protocol to disentangle identity, style, and composition representations, a specialized masked-attention architecture to enforce this disentanglement, and a prior preservation objective. The model is trained on a new 115k-sample dataset curated via a human-in-the-loop pipeline, requires no text prompts or per-subject fine-tuning, introduces a new public benchmark, and claims state-of-the-art performance on identity and style metrics strongly corroborated by user studies.

Significance. If the central claims hold, this would be a significant contribution to generative computer vision by bridging the gap between practical but low-fidelity composition methods and high-fidelity but impractical per-subject fine-tuning approaches. The zero-shot capability, disentanglement strategy, new dataset, and benchmark could enable broader adoption in creative tools and provide reusable resources for future cross-domain synthesis research.

major comments (3)
  1. [Abstract] Abstract: The abstract asserts SOTA results and user-study corroboration but supplies no quantitative tables, baselines, error bars, or ablation details; claims rest on unspecified metrics and a human-filtered dataset whose construction cannot be verified from the provided text. This is load-bearing for the main contribution.
  2. [Methods] Methods (multi-stage training and masked-attention description): The claim that the multi-stage protocol plus masked-attention successfully disentangles identity, style, and composition without concept interference lacks supporting ablation evidence on unseen cross-domain pairs, which is required to substantiate the zero-shot high-fidelity generalization.
  3. [Architecture] Architecture subsection: The description does not specify how the mask is constructed at inference time for completely novel styles or how the training stages are scheduled to avoid interference, leaving the practical zero-shot mechanism underspecified.
minor comments (2)
  1. [Introduction] The distinction between the proposed framework and prior 'blenders' versus 'generators' could be made more precise with explicit comparisons in the introduction.
  2. Notation for the masked-attention mechanism and prior preservation loss would benefit from clearer definitions or pseudocode.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the insightful and constructive review of our manuscript. We have addressed each of the major comments in detail below and plan to incorporate the suggested improvements in the revised version.

read point-by-point responses
  1. Referee: [Abstract] The abstract asserts SOTA results and user-study corroboration but supplies no quantitative tables, baselines, error bars, or ablation details; claims rest on unspecified metrics and a human-filtered dataset whose construction cannot be verified from the provided text. This is load-bearing for the main contribution.

    Authors: We acknowledge the referee's concern regarding the abstract's conciseness. As abstracts have strict length limits, they typically summarize rather than detail quantitative results. In the revised manuscript, we have modified the abstract to specify the key metrics (CLIP similarity for identity and style coherence metrics for harmonization) and to highlight the new benchmark and user studies. We have also added a more detailed account of the dataset curation process in the main text (Section 3), including the human-in-the-loop steps and filtering criteria, to improve verifiability. Full quantitative tables with baselines, error bars, and ablations are provided in Sections 4 and 5. We believe this addresses the load-bearing nature of the claims while maintaining abstract brevity. revision: partial

  2. Referee: [Methods] The claim that the multi-stage protocol plus masked-attention successfully disentangles identity, style, and composition without concept interference lacks supporting ablation evidence on unseen cross-domain pairs, which is required to substantiate the zero-shot high-fidelity generalization.

    Authors: We agree that additional ablation evidence would strengthen the disentanglement claims. In the revised manuscript, we have included new experiments ablating the multi-stage training and masked-attention components, evaluated specifically on unseen cross-domain pairs not encountered during training. These results show that removing either component leads to increased concept interference and degraded performance on identity and style metrics. The ablations are detailed in a new subsection with supporting tables and visualizations. revision: yes

  3. Referee: [Architecture] The description does not specify how the mask is constructed at inference time for completely novel styles or how the training stages are scheduled to avoid interference, leaving the practical zero-shot mechanism underspecified.

    Authors: We thank the referee for pointing out this underspecification. We have expanded the Architecture subsection in the revised manuscript to describe the inference-time mask construction: a segmentation mask is obtained from the reference object using a frozen pre-trained segmenter, which operates independently of the style. Regarding training stage scheduling, we now explicitly outline the sequence: the first stage trains the identity encoder with prior preservation loss, the second stage trains the style components with masked attention while keeping identity fixed, and the third stage fine-tunes the composition with all components. Specific epoch counts and loss coefficients are provided to demonstrate how interference is avoided. A schematic diagram of the staged training has been added for clarity. revision: yes

Circularity Check

0 steps flagged

No circularity: claims rest on independent training protocol and architecture without reduction to fitted inputs

full rationale

The paper presents a multi-stage training protocol, masked-attention mechanism, and prior preservation objective as design choices to achieve disentanglement of identity, style, and composition. No equations, derivations, or fitted parameters are shown that reduce the claimed zero-shot performance or harmony metrics to these inputs by construction. The abstract and described contributions treat the disentanglement as an emergent property of the architecture and data pipeline rather than a tautological renaming or self-referential fit. The framework is evaluated against external benchmarks and user studies, making the central claims self-contained rather than circular.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

The central claim depends on the effectiveness of the introduced multi-stage training, masked attention, and prior preservation in achieving disentanglement; these are presented as novel contributions rather than derived from prior literature.

invented entities (1)
  • masked-attention architecture no independent evidence
    purpose: surgically enforces disentanglement of identity, style, and composition during generation
    Introduced as a specialized component to mitigate concept interference in unified-attention models

pith-pipeline@v0.9.0 · 5594 in / 1362 out tokens · 29820 ms · 2026-05-17T21:13:33.606409+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

57 extracted references · 57 canonical work pages · 3 internal anchors

  1. [1]

    Pexels.https : / / www . pexels . com/. Accessed: November 12, 2025. 7

  2. [2]

    Artistic styles, 2025

    Devansh Agarwal. Artistic styles, 2025. Accessed: October 01, 2025. 7

  3. [3]

    Artflow: Unbiased image style transfer via re- versible neural flows

    Jie An, Siyu Huang, Yibing Song, Dejing Dou, Wei Liu, and Jiebo Luo. Artflow: Unbiased image style transfer via re- versible neural flows. InIEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, virtual, June 19-25, 2021, pages 862–871. Computer Vision Foundation / IEEE, 2021. 3

  4. [4]

    Stable diffusion art: 106 styles for stable diffusion xl model, 2023

    Andrew. Stable diffusion art: 106 styles for stable diffusion xl model, 2023. Accessed on November 05, 2025. 3

  5. [5]

    Zero-shot image editing with reference imitation

    Xi Chen, Yutong Feng, Mengting Chen, Yiyang Wang, Shi- long Zhang, Yu Liu, Yujun Shen, and Hengshuang Zhao. Zero-shot image editing with reference imitation. 2024. 2

  6. [6]

    Anydoor: Zero-shot object-level im- age customization

    Xi Chen, Lianghua Huang, Yu Liu, Yujun Shen, Deli Zhao, and Hengshuang Zhao. Anydoor: Zero-shot object-level im- age customization. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024, Seattle, WA, USA, June 16-22, 2024, pages 6593–6602. IEEE, 2024. 1, 2, 7, 8

  7. [7]

    Style injec- tion in diffusion: A training-free approach for adapting large- scale diffusion models for style transfer

    Jiwoo Chung, Sangeek Hyun, and Jae-Pil Heo. Style injec- tion in diffusion: A training-free approach for adapting large- scale diffusion models for style transfer. InIEEE/CVF Con- ference on Computer Vision and Pattern Recognition, CVPR 2024, Seattle, WA, USA, June 16-22, 2024, pages 8795–

  8. [8]

    Dovenet: Deep image har- monization via domain verification

    Wenyan Cong, Jianfu Zhang, Li Niu, Liu Liu, Zhixin Ling, Weiyuan Li, and Liqing Zhang. Dovenet: Deep image har- monization via domain verification. In2020 IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020, pages 8391–

  9. [9]

    Computer Vision Foundation / IEEE, 2020. 1

  10. [10]

    Scaling rec- tified flow transformers for high-resolution image synthesis

    Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas M ¨uller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, Dustin Podell, Tim Dockhorn, Zion English, and Robin Rombach. Scaling rec- tified flow transformers for high-resolution image synthesis. InForty-first International Conference on Machine Learn- ing, ICML 2024,...

  11. [11]

    Styleshot: A snapshot on any style

    Junyao Gao, Yanchen Liu, Yanan Sun, Yinhao Tang, Yan- hong Zeng, Kai Chen, and Cairong Zhao. Styleshot: A snap- shot on any style.CoRR, abs/2407.01414, 2024. 3

  12. [12]

    Stylebooth: Image style editing with multi- modal instruction

    Zhen Han, Chaojie Mao, Zeyinzi Jiang, Yulin Pan, and Jingfeng Zhang. Stylebooth: Image style editing with multi- modal instruction. InProceedings of the IEEE/CVF Interna- tional Conference on Computer Vision (ICCV) Workshops, pages 1947–1957, 2025. 6

  13. [13]

    Clipscore: A reference-free evaluation met- ric for image captioning

    Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. Clipscore: A reference-free evaluation met- ric for image captioning. pages 7514–7528, 2021. 3, 4, 7

  14. [14]

    Aespa-net: Aesthetic pattern-aware style transfer networks

    Kibeom Hong, Seogkyu Jeon, Junsoo Lee, Namhyuk Ahn, Kunhee Kim, Pilhyeon Lee, Daesik Kim, Youngjung Uh, and Hyeran Byun. Aespa-net: Aesthetic pattern-aware style transfer networks. InIEEE/CVF International Conference on Computer Vision, ICCV 2023, Paris, France, October 1- 6, 2023, pages 22701–22710. IEEE, 2023. 3

  15. [15]

    Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

    Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. In The Tenth International Conference on Learning Represen- tations, ICLR 2022, Virtual Event, April 25-29, 2022. Open- Review.net, 2022. 3, 5

  16. [16]

    Dreamfuse: Adaptive image fusion with diffusion transformer

    Junjia Huang, Pengxiang Yan, Jiyang Liu, Jie Wu, Zhao Wang, Yitong Wang, Liang Lin, and Guanbin Li. Dreamfuse: Adaptive image fusion with diffusion transformer. pages 17292–17301, 2025. 1, 2, 3, 5, 7, 8

  17. [17]

    Human-art: A versatile human-centric dataset bridg- ing natural and artificial scenes

    Xuan Ju, Ailing Zeng, Jianan Wang, Qiang Xu, and Lei Zhang. Human-art: A versatile human-centric dataset bridg- ing natural and artificial scenes. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023, pages 618–629. IEEE, 2023. 7

  18. [18]

    Kingma and Max Welling

    Diederik P. Kingma and Max Welling. Auto-encoding vari- ational bayes. In2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14- 16, 2014, Conference Track Proceedings, 2014. 5

  19. [19]

    Real to ghibli image dataset, 2025

    Shubham Kumar. Real to ghibli image dataset, 2025. Ac- cessed: October 01, 2025. 7

  20. [20]

    Flux.1-dev, 2025

    Black Forest Labs. Flux.1-dev, 2025. Accessed on Novem- ber 05, 2025. 3, 5, 6

  21. [21]

    Flux.1-kontext-dev, 2025

    Black Forest Labs. Flux.1-kontext-dev, 2025. Accessed on November 05, 2025. 3, 4

  22. [22]

    FLUX.1 Kontext: Flow Matching for In-Context Image Generation and Editing in Latent Space

    Black Forest Labs, Stephen Batifol, Andreas Blattmann, Frederic Boesel, Saksham Consul, Cyril Diagne, Tim Dock- horn, Jack English, Zion English, Patrick Esser, Sumith Ku- lal, Kyle Lacey, Yam Levi, Cheng Li, Dominik Lorenz, Jonas M¨uller, Dustin Podell, Robin Rombach, Harry Saini, Axel Sauer, and Luke Smith. FLUX.1 kontext: Flow matching for in-context i...

  23. [23]

    Context-aware synthesis and placement of object instances

    Donghoon Lee, Sifei Liu, Jinwei Gu, Ming-Yu Liu, Ming- Hsuan Yang, and Jan Kautz. Context-aware synthesis and placement of object instances. InAdvances in Neural Infor- mation Processing Systems 31: Annual Conference on Neu- ral Information Processing Systems 2018, NeurIPS 2018, December 3-8, 2018, Montr ´eal, Canada, pages 10414– 10424, 2018. 1

  24. [24]

    Aicomposer: Any style and content im- age composition via feature integration

    Haowen Li, Zhenfeng Fan, Zhang Wen, Zhengzhou Zhu, and Yunjin Li. Aicomposer: Any style and content im- age composition via feature integration. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 16840–16850, 2025. 1, 2, 3, 7, 8

  25. [25]

    Style- tokenizer: Defining image style by a single instance for con- trolling diffusion models, 2024

    Wen Li, Muyuan Fang, Cheng Zou, Biao Gong, Ruobing Zheng, Meng Wang, Jingdong Chen, and Ming Yang. Style- tokenizer: Defining image style by a single instance for con- trolling diffusion models, 2024. 3, 5

  26. [26]

    The artbench dataset: Benchmarking generative models with artworks

    Peiyuan Liao, Xiuyu Li, Xihui Liu, and Kurt Keutzer. The artbench dataset: Benchmarking generative models with art- works.CoRR, abs/2206.11404, 2022. 5

  27. [27]

    Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maxim- ilian Nickel, and Matthew Le. Flow matching for generative modeling. 2023. 5

  28. [28]

    TF- ICON: diffusion-based training-free cross-domain image composition

    Shilin Lu, Yanzhu Liu, and Adams Wai-Kin Kong. TF- ICON: diffusion-based training-free cross-domain image composition. InIEEE/CVF International Conference on Computer Vision, ICCV 2023, Paris, France, October 1-6, 2023, pages 2294–2305. IEEE, 2023. 1, 2, 3, 7, 8

  29. [29]

    Stylecruxgen, 2025

    Bodhisatta Maiti. Stylecruxgen, 2025. Accessed: October 01, 2025. 7

  30. [30]

    Prodigy: An expeditiously adaptive parameter-free learner

    Konstantin Mishchenko and Aaron Defazio. Prodigy: An expeditiously adaptive parameter-free learner. InForty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27, 2024. OpenReview.net, 2024. 7

  31. [31]

    Maxime Oquab, Timoth ´ee Darcet, Th´eo Moutakanni, Huy V . V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mido Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rab- bat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Herv ´e J´egou, Julien Mairal, P...

  32. [32]

    Yang, Zachary DeVito, Mar- tin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala

    Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zem- ing Lin, Natalia Gimelshein, Luca Antiga, Alban Desmai- son, Andreas K ¨opf, Edward Z. Yang, Zachary DeVito, Mar- tin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style,...

  33. [33]

    Scalable diffusion models with transformers

    William Peebles and Saining Xie. Scalable diffusion models with transformers. InIEEE/CVF International Conference on Computer Vision, ICCV 2023, Paris, France, October 1- 6, 2023, pages 4172–4182. IEEE, 2023. 2, 3, 5

  34. [34]

    Pham, Jingye Chen, and Qifeng Chen

    Kien T. Pham, Jingye Chen, and Qifeng Chen. TALE: training-free cross-domain image composition via adaptive latent manipulation and energy-guided optimization. InPro- ceedings of the 32nd ACM International Conference on Mul- timedia, MM 2024, Melbourne, VIC, Australia, 28 October 2024 - 1 November 2024, pages 3160–3169. ACM, 2024. 3, 7, 8

  35. [35]

    Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer.J. Mach. Learn. Res., 21: 140:1–140:67, 2020. 5

  36. [36]

    Use flux.1 kontext to edit images with words

    Replicate. Use flux.1 kontext to edit images with words. https://replicate.com/blog/flux-kontext,

  37. [37]

    Accessed: October 12, 2025. 3

  38. [38]

    Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation

    Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. InIEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023, pages 22500–22510. IEEE,

  39. [39]

    Jacobs, and Shlomi Fruchter

    Nataniel Ruiz, Yuanzhen Li, Neal Wadhwa, Yael Pritch, Michael Rubinstein, David E. Jacobs, and Shlomi Fruchter. Magic insert: Style-aware drag-and-drop. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 15971–15981, 2025. 2, 3

  40. [40]

    Stylized image dataset, 2025

    Rithish Kanna S. Stylized image dataset, 2025. Accessed: October 01, 2025. 7

  41. [41]

    Large-scale Classification of Fine-Art Paintings: Learning The Right Metric on The Right Feature

    Babak Saleh and Ahmed M. Elgammal. Large-scale classifi- cation of fine-art paintings: Learning the right metric on the right feature.CoRR, abs/1505.00855, 2015. 5, 7

  42. [42]

    LAION-5B: an open large-scale dataset for training next generation image-text models

    Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Worts- man, Patrick Schramowski, Srivatsa Kundurthy, Katherine Crowson, Ludwig Schmidt, Robert Kaczmarczyk, and Jenia Jitsev. LAION-5B: an open large-scale dataset for training next generation image-text model...

  43. [43]

    Investigating style similarity in diffusion models

    Gowthami Somepalli, Anubhav Gupta, Kamal Gupta, Shra- may Palta, Micah Goldblum, Jonas Geiping, Abhinav Shri- vastava, and Tom Goldstein. Investigating style similarity in diffusion models. InComputer Vision - ECCV 2024 - 18th European Conference, Milan, Italy, September 29-October 4, 2024, Proceedings, Part LXVI, pages 143–160. Springer,

  44. [44]

    Insert anything: Image insertion via in-context editing in dit.arXiv preprint arXiv:2504.15009,

    Wensong Song, Hong Jiang, Zongxing Yang, Ruijie Quan, and Yi Yang. Insert anything: Image insertion via in-context editing in dit.CoRR, abs/2504.15009, 2025. 2

  45. [45]

    Price, Jianming Zhang, Soo Ye Kim, He Zhang, Wei Xiong, and Daniel G

    Yizhi Song, Zhifei Zhang, Zhe Lin, Scott Cohen, Brian L. Price, Jianming Zhang, Soo Ye Kim, He Zhang, Wei Xiong, and Daniel G. Aliaga. IMPRINT: generative object com- positing by learning identity-preserving representation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024, Seattle, WA, USA, June 16-22, 2024, pages 8048–8058. IE...

  46. [46]

    Omniconsis- tency: Learning style-agnostic consistency from paired styl- ization data

    Yiren Song, Cheng Liu, and Mike Zheng Shou. Omniconsis- tency: Learning style-agnostic consistency from paired styl- ization data. InAdvances in Neural Information Processing Systems (NeurIPS), 2025. 3, 5

  47. [47]

    Ominicontrol: Minimal and universal control for diffusion transformer

    Zhenxiong Tan, Songhua Liu, Xingyi Yang, Qiaochu Xue, and Xinchao Wang. Ominicontrol: Minimal and universal control for diffusion transformer. pages 14940–14950, 2025. 2, 3, 4, 6

  48. [48]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Gemini Team. Gemini 2.5: Pushing the frontier with ad- vanced reasoning, multimodality, long context, and next gen- eration agentic capabilities.CoRR, abs/2507.06261, 2025. 7

  49. [49]

    Deep image harmonization

    Yi-Hsuan Tsai, Xiaohui Shen, Zhe Lin, Kalyan Sunkavalli, Xin Lu, and Ming-Hsuan Yang. Deep image harmonization. In2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pages 2799–2807. IEEE Computer Society, 2017. 1

  50. [50]

    Dataset with 30k images in 20 artistic styles, 2025

    Unidata. Dataset with 30k images in 20 artistic styles, 2025. Accessed: October 01, 2025. 7

  51. [51]

    Primecomposer: Faster progressively combined diffu- sion for image composition with attention steering

    Yibin Wang, Weizhong Zhang, Jianwei Zheng, and Cheng Jin. Primecomposer: Faster progressively combined diffu- sion for image composition with attention steering. InPro- ceedings of the 32nd ACM International Conference on Mul- timedia, MM 2024, Melbourne, VIC, Australia, 28 October 2024 - 1 November 2024, pages 10824–10832. ACM, 2024. 3

  52. [52]

    Omnistyle: Filtering high quality style transfer data at scale

    Ye Wang, Ruiqi Liu, Jiang Lin, Fei Liu, Zili Yi, Yilin Wang, and Rui Ma. Omnistyle: Filtering high quality style transfer data at scale. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2025, Nashville, TN, USA, June 11-15, 2025, pages 7847–7856. Computer Vision Foun- dation / IEEE, 2025. 3, 5, 6

  53. [53]

    Csgo: Content-style composition in text-to-image genera- tion

    Peng Xing, Haofan Wang, Yanpeng Sun, Qixun Wang, Xu Bai, Hao Ai, Renyuan Huang, and Zechao Li. Csgo: Content-style composition in text-to-image genera- tion. InAdvances in Neural Information Processing Systems (NeurIPS), 2025. 3, 4

  54. [54]

    Stylemaster: Stylize your video with artistic generation and translation

    Zixuan Ye, Huijuan Huang, Xintao Wang, Pengfei Wan, Di Zhang, and Wenhan Luo. Stylemaster: Stylize your video with artistic generation and translation. InIEEE/CVF Con- ference on Computer Vision and Pattern Recognition, CVPR 2025, Nashville, TN, USA, June 11-15, 2025, pages 2630–

  55. [55]

    Computer Vision Foundation / IEEE, 2025. 3

  56. [56]

    Controlcom: Control- lable image composition using diffusion model.CoRR, abs/2308.10040, 2023

    Bo Zhang, Yuxuan Duan, Jun Lan, Yan Hong, Huijia Zhu, Weiqiang Wang, and Li Niu. Controlcom: Control- lable image composition using diffusion model.CoRR, abs/2308.10040, 2023. 2

  57. [57]

    Do- main enhanced arbitrary image style transfer via contrastive learning

    Yuxin Zhang, Fan Tang, Weiming Dong, Haibin Huang, Chongyang Ma, Tong-Yee Lee, and Changsheng Xu. Do- main enhanced arbitrary image style transfer via contrastive learning. InSIGGRAPH ’22: Special Interest Group on Computer Graphics and Interactive Techniques Conference, Vancouver, BC, Canada, August 7 - 11, 2022, pages 12:1– 12:8. ACM, 2022. 3, 4