Insert In Style: A Zero-Shot Generative Framework for Harmonious Cross-Domain Object Composition
Pith reviewed 2026-05-17 21:13 UTC · model grok-4.3
The pith
A zero-shot generative model inserts reference objects into stylized backgrounds while preserving their identity and achieving visual harmony.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The framework demonstrates that disentangling identity, style, and composition representations through a multi-stage training protocol and a specialized masked-attention architecture enables high-fidelity cross-domain object composition in a zero-shot manner. Supported by a prior preservation objective and trained on a 115k sample dataset created via large-scale generation followed by iterative human-in-the-loop filtering, the model performs harmonious insertions without text prompts or per-subject optimization, outperforming prior methods on identity and style metrics as validated by user studies.
What carries the argument
Multi-stage training protocol combined with masked-attention architecture that enforces separation of identity, style, and composition representations during generation.
If this is right
- The model requires no per-subject fine-tuning or text prompts for operation.
- It achieves state-of-the-art results on both quantitative identity and style metrics.
- User studies confirm superior visual quality and harmony in the composed images.
- A new public benchmark is provided for evaluating stylized composition tasks.
- The approach generalizes to diverse unseen reference objects and target styles.
Where Pith is reading between the lines
- If the disentanglement succeeds, the same architecture could be adapted for inserting objects into video frames by processing each frame consistently.
- Designers could use this to quickly prototype product placements in various artistic styles without additional model training.
- Future work might test whether the method scales to more complex scenes involving multiple objects or 3D references.
- The human-in-the-loop data curation process could be applied to other generative tasks to improve dataset quality.
Load-bearing premise
The multi-stage training and masked attention successfully separate identity, style, and composition without leaving any interfering overlaps that affect output quality on new examples.
What would settle it
Running the model on the new benchmark and finding that its identity preservation scores fall below those of a per-subject fine-tuned generator, or that users in blind tests prefer outputs from existing methods for overall harmony.
Figures
read the original abstract
Reference-based object composition involves integrating foreground reference image with background scene to produce harmonious fused image. This task becomes particularly challenging in cross-domain scenarios, where models must balance preserving the reference object's identity while harmonizing them to match stylized environments. This under-explored problem is currently split between practical "blenders" that lack generative fidelity and "generators" that require impractical, per-subject online finetuning. In this work, we introduce Insert In Style, the first zero-shot generative framework that is both practical and high-fidelity. Our core contribution is a unified framework with two key innovations: (i) a novel multi-stage training protocol that disentangles representations for identity, style, and composition, and (ii) a specialized masked-attention architecture that surgically enforces this disentanglement during generation (iii) A prior preservation objective that keeps learned identity and style priors intact. By design, this approach mitigates concept interference typical in unified-attention architectures while ensuring robust generalization across diverse references and styles. Our framework is trained on a new 115k sample dataset, curated from a novel data pipeline. This pipeline couples large-scale generation with a rigorous, iterative human-in-the-loop filtering process to ensure both high-fidelity semantic identity and style coherence. Unlike prior work, our model is truly zero-shot and requires no text prompts. We also introduce a new public benchmark for stylized composition. We demonstrate state-of-the-art performance, significantly outperforming existing methods on both identity and style metrics, a result strongly corroborated by user studies.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Insert In Style, a zero-shot generative framework for reference-based object composition in cross-domain scenarios. It proposes a multi-stage training protocol to disentangle identity, style, and composition representations, a specialized masked-attention architecture to enforce this disentanglement, and a prior preservation objective. The model is trained on a new 115k-sample dataset curated via a human-in-the-loop pipeline, requires no text prompts or per-subject fine-tuning, introduces a new public benchmark, and claims state-of-the-art performance on identity and style metrics strongly corroborated by user studies.
Significance. If the central claims hold, this would be a significant contribution to generative computer vision by bridging the gap between practical but low-fidelity composition methods and high-fidelity but impractical per-subject fine-tuning approaches. The zero-shot capability, disentanglement strategy, new dataset, and benchmark could enable broader adoption in creative tools and provide reusable resources for future cross-domain synthesis research.
major comments (3)
- [Abstract] Abstract: The abstract asserts SOTA results and user-study corroboration but supplies no quantitative tables, baselines, error bars, or ablation details; claims rest on unspecified metrics and a human-filtered dataset whose construction cannot be verified from the provided text. This is load-bearing for the main contribution.
- [Methods] Methods (multi-stage training and masked-attention description): The claim that the multi-stage protocol plus masked-attention successfully disentangles identity, style, and composition without concept interference lacks supporting ablation evidence on unseen cross-domain pairs, which is required to substantiate the zero-shot high-fidelity generalization.
- [Architecture] Architecture subsection: The description does not specify how the mask is constructed at inference time for completely novel styles or how the training stages are scheduled to avoid interference, leaving the practical zero-shot mechanism underspecified.
minor comments (2)
- [Introduction] The distinction between the proposed framework and prior 'blenders' versus 'generators' could be made more precise with explicit comparisons in the introduction.
- Notation for the masked-attention mechanism and prior preservation loss would benefit from clearer definitions or pseudocode.
Simulated Author's Rebuttal
We thank the referee for the insightful and constructive review of our manuscript. We have addressed each of the major comments in detail below and plan to incorporate the suggested improvements in the revised version.
read point-by-point responses
-
Referee: [Abstract] The abstract asserts SOTA results and user-study corroboration but supplies no quantitative tables, baselines, error bars, or ablation details; claims rest on unspecified metrics and a human-filtered dataset whose construction cannot be verified from the provided text. This is load-bearing for the main contribution.
Authors: We acknowledge the referee's concern regarding the abstract's conciseness. As abstracts have strict length limits, they typically summarize rather than detail quantitative results. In the revised manuscript, we have modified the abstract to specify the key metrics (CLIP similarity for identity and style coherence metrics for harmonization) and to highlight the new benchmark and user studies. We have also added a more detailed account of the dataset curation process in the main text (Section 3), including the human-in-the-loop steps and filtering criteria, to improve verifiability. Full quantitative tables with baselines, error bars, and ablations are provided in Sections 4 and 5. We believe this addresses the load-bearing nature of the claims while maintaining abstract brevity. revision: partial
-
Referee: [Methods] The claim that the multi-stage protocol plus masked-attention successfully disentangles identity, style, and composition without concept interference lacks supporting ablation evidence on unseen cross-domain pairs, which is required to substantiate the zero-shot high-fidelity generalization.
Authors: We agree that additional ablation evidence would strengthen the disentanglement claims. In the revised manuscript, we have included new experiments ablating the multi-stage training and masked-attention components, evaluated specifically on unseen cross-domain pairs not encountered during training. These results show that removing either component leads to increased concept interference and degraded performance on identity and style metrics. The ablations are detailed in a new subsection with supporting tables and visualizations. revision: yes
-
Referee: [Architecture] The description does not specify how the mask is constructed at inference time for completely novel styles or how the training stages are scheduled to avoid interference, leaving the practical zero-shot mechanism underspecified.
Authors: We thank the referee for pointing out this underspecification. We have expanded the Architecture subsection in the revised manuscript to describe the inference-time mask construction: a segmentation mask is obtained from the reference object using a frozen pre-trained segmenter, which operates independently of the style. Regarding training stage scheduling, we now explicitly outline the sequence: the first stage trains the identity encoder with prior preservation loss, the second stage trains the style components with masked attention while keeping identity fixed, and the third stage fine-tunes the composition with all components. Specific epoch counts and loss coefficients are provided to demonstrate how interference is avoided. A schematic diagram of the staged training has been added for clarity. revision: yes
Circularity Check
No circularity: claims rest on independent training protocol and architecture without reduction to fitted inputs
full rationale
The paper presents a multi-stage training protocol, masked-attention mechanism, and prior preservation objective as design choices to achieve disentanglement of identity, style, and composition. No equations, derivations, or fitted parameters are shown that reduce the claimed zero-shot performance or harmony metrics to these inputs by construction. The abstract and described contributions treat the disentanglement as an emergent property of the architecture and data pipeline rather than a tautological renaming or self-referential fit. The framework is evaluated against external benchmarks and user studies, making the central claims self-contained rather than circular.
Axiom & Free-Parameter Ledger
invented entities (1)
-
masked-attention architecture
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Pexels.https : / / www . pexels . com/. Accessed: November 12, 2025. 7
work page 2025
-
[2]
Devansh Agarwal. Artistic styles, 2025. Accessed: October 01, 2025. 7
work page 2025
-
[3]
Artflow: Unbiased image style transfer via re- versible neural flows
Jie An, Siyu Huang, Yibing Song, Dejing Dou, Wei Liu, and Jiebo Luo. Artflow: Unbiased image style transfer via re- versible neural flows. InIEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, virtual, June 19-25, 2021, pages 862–871. Computer Vision Foundation / IEEE, 2021. 3
work page 2021
-
[4]
Stable diffusion art: 106 styles for stable diffusion xl model, 2023
Andrew. Stable diffusion art: 106 styles for stable diffusion xl model, 2023. Accessed on November 05, 2025. 3
work page 2023
-
[5]
Zero-shot image editing with reference imitation
Xi Chen, Yutong Feng, Mengting Chen, Yiyang Wang, Shi- long Zhang, Yu Liu, Yujun Shen, and Hengshuang Zhao. Zero-shot image editing with reference imitation. 2024. 2
work page 2024
-
[6]
Anydoor: Zero-shot object-level im- age customization
Xi Chen, Lianghua Huang, Yu Liu, Yujun Shen, Deli Zhao, and Hengshuang Zhao. Anydoor: Zero-shot object-level im- age customization. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024, Seattle, WA, USA, June 16-22, 2024, pages 6593–6602. IEEE, 2024. 1, 2, 7, 8
work page 2024
-
[7]
Jiwoo Chung, Sangeek Hyun, and Jae-Pil Heo. Style injec- tion in diffusion: A training-free approach for adapting large- scale diffusion models for style transfer. InIEEE/CVF Con- ference on Computer Vision and Pattern Recognition, CVPR 2024, Seattle, WA, USA, June 16-22, 2024, pages 8795–
work page 2024
-
[8]
Dovenet: Deep image har- monization via domain verification
Wenyan Cong, Jianfu Zhang, Li Niu, Liu Liu, Zhixin Ling, Weiyuan Li, and Liqing Zhang. Dovenet: Deep image har- monization via domain verification. In2020 IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020, pages 8391–
work page 2020
-
[9]
Computer Vision Foundation / IEEE, 2020. 1
work page 2020
-
[10]
Scaling rec- tified flow transformers for high-resolution image synthesis
Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas M ¨uller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, Dustin Podell, Tim Dockhorn, Zion English, and Robin Rombach. Scaling rec- tified flow transformers for high-resolution image synthesis. InForty-first International Conference on Machine Learn- ing, ICML 2024,...
work page 2024
-
[11]
Styleshot: A snapshot on any style
Junyao Gao, Yanchen Liu, Yanan Sun, Yinhao Tang, Yan- hong Zeng, Kai Chen, and Cairong Zhao. Styleshot: A snap- shot on any style.CoRR, abs/2407.01414, 2024. 3
-
[12]
Stylebooth: Image style editing with multi- modal instruction
Zhen Han, Chaojie Mao, Zeyinzi Jiang, Yulin Pan, and Jingfeng Zhang. Stylebooth: Image style editing with multi- modal instruction. InProceedings of the IEEE/CVF Interna- tional Conference on Computer Vision (ICCV) Workshops, pages 1947–1957, 2025. 6
work page 1947
-
[13]
Clipscore: A reference-free evaluation met- ric for image captioning
Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. Clipscore: A reference-free evaluation met- ric for image captioning. pages 7514–7528, 2021. 3, 4, 7
work page 2021
-
[14]
Aespa-net: Aesthetic pattern-aware style transfer networks
Kibeom Hong, Seogkyu Jeon, Junsoo Lee, Namhyuk Ahn, Kunhee Kim, Pilhyeon Lee, Daesik Kim, Youngjung Uh, and Hyeran Byun. Aespa-net: Aesthetic pattern-aware style transfer networks. InIEEE/CVF International Conference on Computer Vision, ICCV 2023, Paris, France, October 1- 6, 2023, pages 22701–22710. IEEE, 2023. 3
work page 2023
-
[15]
Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen
Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. In The Tenth International Conference on Learning Represen- tations, ICLR 2022, Virtual Event, April 25-29, 2022. Open- Review.net, 2022. 3, 5
work page 2022
-
[16]
Dreamfuse: Adaptive image fusion with diffusion transformer
Junjia Huang, Pengxiang Yan, Jiyang Liu, Jie Wu, Zhao Wang, Yitong Wang, Liang Lin, and Guanbin Li. Dreamfuse: Adaptive image fusion with diffusion transformer. pages 17292–17301, 2025. 1, 2, 3, 5, 7, 8
work page 2025
-
[17]
Human-art: A versatile human-centric dataset bridg- ing natural and artificial scenes
Xuan Ju, Ailing Zeng, Jianan Wang, Qiang Xu, and Lei Zhang. Human-art: A versatile human-centric dataset bridg- ing natural and artificial scenes. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023, pages 618–629. IEEE, 2023. 7
work page 2023
-
[18]
Diederik P. Kingma and Max Welling. Auto-encoding vari- ational bayes. In2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14- 16, 2014, Conference Track Proceedings, 2014. 5
work page 2014
-
[19]
Real to ghibli image dataset, 2025
Shubham Kumar. Real to ghibli image dataset, 2025. Ac- cessed: October 01, 2025. 7
work page 2025
-
[20]
Black Forest Labs. Flux.1-dev, 2025. Accessed on Novem- ber 05, 2025. 3, 5, 6
work page 2025
-
[21]
Black Forest Labs. Flux.1-kontext-dev, 2025. Accessed on November 05, 2025. 3, 4
work page 2025
-
[22]
FLUX.1 Kontext: Flow Matching for In-Context Image Generation and Editing in Latent Space
Black Forest Labs, Stephen Batifol, Andreas Blattmann, Frederic Boesel, Saksham Consul, Cyril Diagne, Tim Dock- horn, Jack English, Zion English, Patrick Esser, Sumith Ku- lal, Kyle Lacey, Yam Levi, Cheng Li, Dominik Lorenz, Jonas M¨uller, Dustin Podell, Robin Rombach, Harry Saini, Axel Sauer, and Luke Smith. FLUX.1 kontext: Flow matching for in-context i...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[23]
Context-aware synthesis and placement of object instances
Donghoon Lee, Sifei Liu, Jinwei Gu, Ming-Yu Liu, Ming- Hsuan Yang, and Jan Kautz. Context-aware synthesis and placement of object instances. InAdvances in Neural Infor- mation Processing Systems 31: Annual Conference on Neu- ral Information Processing Systems 2018, NeurIPS 2018, December 3-8, 2018, Montr ´eal, Canada, pages 10414– 10424, 2018. 1
work page 2018
-
[24]
Aicomposer: Any style and content im- age composition via feature integration
Haowen Li, Zhenfeng Fan, Zhang Wen, Zhengzhou Zhu, and Yunjin Li. Aicomposer: Any style and content im- age composition via feature integration. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 16840–16850, 2025. 1, 2, 3, 7, 8
work page 2025
-
[25]
Style- tokenizer: Defining image style by a single instance for con- trolling diffusion models, 2024
Wen Li, Muyuan Fang, Cheng Zou, Biao Gong, Ruobing Zheng, Meng Wang, Jingdong Chen, and Ming Yang. Style- tokenizer: Defining image style by a single instance for con- trolling diffusion models, 2024. 3, 5
work page 2024
-
[26]
The artbench dataset: Benchmarking generative models with artworks
Peiyuan Liao, Xiuyu Li, Xihui Liu, and Kurt Keutzer. The artbench dataset: Benchmarking generative models with art- works.CoRR, abs/2206.11404, 2022. 5
-
[27]
Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maxim- ilian Nickel, and Matthew Le. Flow matching for generative modeling. 2023. 5
work page 2023
-
[28]
TF- ICON: diffusion-based training-free cross-domain image composition
Shilin Lu, Yanzhu Liu, and Adams Wai-Kin Kong. TF- ICON: diffusion-based training-free cross-domain image composition. InIEEE/CVF International Conference on Computer Vision, ICCV 2023, Paris, France, October 1-6, 2023, pages 2294–2305. IEEE, 2023. 1, 2, 3, 7, 8
work page 2023
-
[29]
Bodhisatta Maiti. Stylecruxgen, 2025. Accessed: October 01, 2025. 7
work page 2025
-
[30]
Prodigy: An expeditiously adaptive parameter-free learner
Konstantin Mishchenko and Aaron Defazio. Prodigy: An expeditiously adaptive parameter-free learner. InForty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27, 2024. OpenReview.net, 2024. 7
work page 2024
-
[31]
Maxime Oquab, Timoth ´ee Darcet, Th´eo Moutakanni, Huy V . V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mido Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rab- bat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Herv ´e J´egou, Julien Mairal, P...
work page 2024
-
[32]
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zem- ing Lin, Natalia Gimelshein, Luca Antiga, Alban Desmai- son, Andreas K ¨opf, Edward Z. Yang, Zachary DeVito, Mar- tin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style,...
work page 2019
-
[33]
Scalable diffusion models with transformers
William Peebles and Saining Xie. Scalable diffusion models with transformers. InIEEE/CVF International Conference on Computer Vision, ICCV 2023, Paris, France, October 1- 6, 2023, pages 4172–4182. IEEE, 2023. 2, 3, 5
work page 2023
-
[34]
Pham, Jingye Chen, and Qifeng Chen
Kien T. Pham, Jingye Chen, and Qifeng Chen. TALE: training-free cross-domain image composition via adaptive latent manipulation and energy-guided optimization. InPro- ceedings of the 32nd ACM International Conference on Mul- timedia, MM 2024, Melbourne, VIC, Australia, 28 October 2024 - 1 November 2024, pages 3160–3169. ACM, 2024. 3, 7, 8
work page 2024
-
[35]
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer.J. Mach. Learn. Res., 21: 140:1–140:67, 2020. 5
work page 2020
-
[36]
Use flux.1 kontext to edit images with words
Replicate. Use flux.1 kontext to edit images with words. https://replicate.com/blog/flux-kontext,
-
[37]
Accessed: October 12, 2025. 3
work page 2025
-
[38]
Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation
Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. InIEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023, pages 22500–22510. IEEE,
work page 2023
-
[39]
Nataniel Ruiz, Yuanzhen Li, Neal Wadhwa, Yael Pritch, Michael Rubinstein, David E. Jacobs, and Shlomi Fruchter. Magic insert: Style-aware drag-and-drop. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 15971–15981, 2025. 2, 3
work page 2025
-
[40]
Rithish Kanna S. Stylized image dataset, 2025. Accessed: October 01, 2025. 7
work page 2025
-
[41]
Large-scale Classification of Fine-Art Paintings: Learning The Right Metric on The Right Feature
Babak Saleh and Ahmed M. Elgammal. Large-scale classifi- cation of fine-art paintings: Learning the right metric on the right feature.CoRR, abs/1505.00855, 2015. 5, 7
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[42]
LAION-5B: an open large-scale dataset for training next generation image-text models
Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Worts- man, Patrick Schramowski, Srivatsa Kundurthy, Katherine Crowson, Ludwig Schmidt, Robert Kaczmarczyk, and Jenia Jitsev. LAION-5B: an open large-scale dataset for training next generation image-text model...
work page 2022
-
[43]
Investigating style similarity in diffusion models
Gowthami Somepalli, Anubhav Gupta, Kamal Gupta, Shra- may Palta, Micah Goldblum, Jonas Geiping, Abhinav Shri- vastava, and Tom Goldstein. Investigating style similarity in diffusion models. InComputer Vision - ECCV 2024 - 18th European Conference, Milan, Italy, September 29-October 4, 2024, Proceedings, Part LXVI, pages 143–160. Springer,
work page 2024
-
[44]
Insert anything: Image insertion via in-context editing in dit.arXiv preprint arXiv:2504.15009,
Wensong Song, Hong Jiang, Zongxing Yang, Ruijie Quan, and Yi Yang. Insert anything: Image insertion via in-context editing in dit.CoRR, abs/2504.15009, 2025. 2
-
[45]
Price, Jianming Zhang, Soo Ye Kim, He Zhang, Wei Xiong, and Daniel G
Yizhi Song, Zhifei Zhang, Zhe Lin, Scott Cohen, Brian L. Price, Jianming Zhang, Soo Ye Kim, He Zhang, Wei Xiong, and Daniel G. Aliaga. IMPRINT: generative object com- positing by learning identity-preserving representation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024, Seattle, WA, USA, June 16-22, 2024, pages 8048–8058. IE...
work page 2024
-
[46]
Omniconsis- tency: Learning style-agnostic consistency from paired styl- ization data
Yiren Song, Cheng Liu, and Mike Zheng Shou. Omniconsis- tency: Learning style-agnostic consistency from paired styl- ization data. InAdvances in Neural Information Processing Systems (NeurIPS), 2025. 3, 5
work page 2025
-
[47]
Ominicontrol: Minimal and universal control for diffusion transformer
Zhenxiong Tan, Songhua Liu, Xingyi Yang, Qiaochu Xue, and Xinchao Wang. Ominicontrol: Minimal and universal control for diffusion transformer. pages 14940–14950, 2025. 2, 3, 4, 6
work page 2025
-
[48]
Gemini Team. Gemini 2.5: Pushing the frontier with ad- vanced reasoning, multimodality, long context, and next gen- eration agentic capabilities.CoRR, abs/2507.06261, 2025. 7
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[49]
Yi-Hsuan Tsai, Xiaohui Shen, Zhe Lin, Kalyan Sunkavalli, Xin Lu, and Ming-Hsuan Yang. Deep image harmonization. In2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pages 2799–2807. IEEE Computer Society, 2017. 1
work page 2017
-
[50]
Dataset with 30k images in 20 artistic styles, 2025
Unidata. Dataset with 30k images in 20 artistic styles, 2025. Accessed: October 01, 2025. 7
work page 2025
-
[51]
Yibin Wang, Weizhong Zhang, Jianwei Zheng, and Cheng Jin. Primecomposer: Faster progressively combined diffu- sion for image composition with attention steering. InPro- ceedings of the 32nd ACM International Conference on Mul- timedia, MM 2024, Melbourne, VIC, Australia, 28 October 2024 - 1 November 2024, pages 10824–10832. ACM, 2024. 3
work page 2024
-
[52]
Omnistyle: Filtering high quality style transfer data at scale
Ye Wang, Ruiqi Liu, Jiang Lin, Fei Liu, Zili Yi, Yilin Wang, and Rui Ma. Omnistyle: Filtering high quality style transfer data at scale. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2025, Nashville, TN, USA, June 11-15, 2025, pages 7847–7856. Computer Vision Foun- dation / IEEE, 2025. 3, 5, 6
work page 2025
-
[53]
Csgo: Content-style composition in text-to-image genera- tion
Peng Xing, Haofan Wang, Yanpeng Sun, Qixun Wang, Xu Bai, Hao Ai, Renyuan Huang, and Zechao Li. Csgo: Content-style composition in text-to-image genera- tion. InAdvances in Neural Information Processing Systems (NeurIPS), 2025. 3, 4
work page 2025
-
[54]
Stylemaster: Stylize your video with artistic generation and translation
Zixuan Ye, Huijuan Huang, Xintao Wang, Pengfei Wan, Di Zhang, and Wenhan Luo. Stylemaster: Stylize your video with artistic generation and translation. InIEEE/CVF Con- ference on Computer Vision and Pattern Recognition, CVPR 2025, Nashville, TN, USA, June 11-15, 2025, pages 2630–
work page 2025
-
[55]
Computer Vision Foundation / IEEE, 2025. 3
work page 2025
-
[56]
Controlcom: Control- lable image composition using diffusion model.CoRR, abs/2308.10040, 2023
Bo Zhang, Yuxuan Duan, Jun Lan, Yan Hong, Huijia Zhu, Weiqiang Wang, and Li Niu. Controlcom: Control- lable image composition using diffusion model.CoRR, abs/2308.10040, 2023. 2
-
[57]
Do- main enhanced arbitrary image style transfer via contrastive learning
Yuxin Zhang, Fan Tang, Weiming Dong, Haibin Huang, Chongyang Ma, Tong-Yee Lee, and Changsheng Xu. Do- main enhanced arbitrary image style transfer via contrastive learning. InSIGGRAPH ’22: Special Interest Group on Computer Graphics and Interactive Techniques Conference, Vancouver, BC, Canada, August 7 - 11, 2022, pages 12:1– 12:8. ACM, 2022. 3, 4
work page 2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.