A Unified and Controllable Framework for Layered Image Generation with Visual Effects
Pith reviewed 2026-05-16 11:51 UTC · model grok-4.3
The pith
LASAGNA generates a photorealistic background and RGBA foreground with integrated effects in one forward pass.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
LASAGNA generates a photorealistic background and an RGBA foreground with compelling visual effects in a single forward pass. By treating object-associated visual effects as part of the foreground layer, it supports edits via alpha compositing alone, eliminating identity drift from cascade pipelines. It handles diverse inputs in a unified architecture and introduces a new dataset and benchmark for layered generation.
What carries the argument
The LASAGNA unified neural architecture that synthesizes background and foreground layers with integrated visual effects together.
If this is right
- Enables translation, scaling, recoloring, and duplication of objects using only alpha compositing.
- Eliminates the need for post-edit harmonization models.
- Outperforms prior methods in generation quality across multiple modes.
- Supports a wide range of post-edits without model re-inference.
Where Pith is reading between the lines
- This layered approach could simplify interactive image editing tools by reducing computation for each edit.
- The released dataset of 48K triplets may enable training of specialized models for effect prediction.
- Extending the single-pass design to video sequences could produce consistent layered videos.
Load-bearing premise
A single unified model can generate both photorealistic backgrounds and foregrounds with integrated visual effects reliably enough that alpha compositing preserves identity without further harmonization.
What would settle it
Compositing the generated foreground onto a new background and observing whether the object identity remains unchanged with no additional processing or model calls.
Figures
read the original abstract
Recent image generation models produce impressive composites, but often fail to preserve the identity of user-provided content when editing specific elements: the surrounding scene may shift, and even the edited object's appearance can drift from the original. Layered representation offer a natural remedy--they allow users to independently manipulate individual elements--but existing layered methods typically produce transparent foregrounds without realistic visual effects such as shadows and reflections, forcing the use of a second harmonization model after every edit, which in turn introduces drift. To overcome these limitations, we present LASAGNA, which generates a photorealistic background (BG) and an RGBA foreground with compelling visual effects in a single forward pass. By treating object-associated visual effects as part of the foreground (FG) layer, LASAGNA supports the dominant class of consumer edits (e.g., translation, scaling, recoloring, duplication) via alpha compositing alone, without invoking any model post-edit, thereby eliminating identity drift introduced by cascade editing pipelines. This single-pass design contrasts with prior layered methods that rely on separate expert models for each task. LASAGNA handles diverse conditional inputs--text prompts, FG, BG, and location masks--within a unified architecture. We further release two community resources: LASAGNA-48K, the first public dataset of 48K layered image triplets with photorealistic visual effects, and LASAGNA-BENCH, the first standardized benchmark for layer-centric generation and editing, comprising 242 expert-annotated samples across six diverse sources. Experiments show that LASAGNA outperforms both general-purpose editors and prior layered methods across three generation modes, and supports a wide range of post-edits without any model re-inference.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces LASAGNA, a unified neural model that generates a photorealistic background layer and an RGBA foreground layer (with integrated visual effects such as shadows and reflections) in a single forward pass. By baking effects into the foreground, the approach enables common consumer edits (translation, scaling, recoloring, duplication) via standard alpha compositing without any additional model inference or harmonization step, thereby avoiding identity drift common in cascaded pipelines. The manuscript also releases the LASAGNA-48K dataset (48K layered triplets) and LASAGNA-BENCH (242 expert-annotated samples across six sources), and reports that LASAGNA outperforms both general-purpose editors and prior layered methods across three generation modes while supporting post-edits without re-inference.
Significance. If the single-pass integration of effects and the quantitative superiority hold, the work would meaningfully advance controllable layered image generation by simplifying editing workflows and reducing drift. The public release of a large-scale dataset with photorealistic effects and a standardized benchmark would provide lasting value to the community, enabling reproducible comparisons on layer-centric tasks.
major comments (3)
- [§3] §3 (Method): The unified conditioning mechanism for handling text prompts, FG, BG, and location masks in one architecture is described at a high level but lacks explicit equations or diagrams showing how the inputs are fused without compromising BG photorealism or FG effect consistency; this is load-bearing for the single-pass claim.
- [§4] §4 (Experiments): The abstract states outperformance across three generation modes and support for post-edits without re-inference, yet no specific metrics (e.g., identity preservation scores, FID, or user-study results), tables, or ablation studies comparing against cascade baselines are referenced; without these the central claim that alpha compositing alone suffices cannot be verified.
- [§5] §5 (Dataset and Benchmark): The construction of LASAGNA-48K and LASAGNA-BENCH is presented as a contribution, but details on how visual effects were rendered and validated for photorealism, plus the exact annotation protocol for the 242 samples, are needed to assess whether the resources truly enable fair evaluation of effect integration.
minor comments (2)
- [Abstract] Abstract: 'Layered representation offer' should read 'Layered representations offer'.
- [Notation] Notation: The distinction between BG and FG layers is clear, but the paper should explicitly define the RGBA representation and alpha-compositing formula used for edits to aid reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed comments, which help clarify the presentation of our contributions. We address each major point below and will revise the manuscript to incorporate the requested details while preserving the core claims.
read point-by-point responses
-
Referee: [§3] §3 (Method): The unified conditioning mechanism for handling text prompts, FG, BG, and location masks in one architecture is described at a high level but lacks explicit equations or diagrams showing how the inputs are fused without compromising BG photorealism or FG effect consistency; this is load-bearing for the single-pass claim.
Authors: We agree that the current description is high-level. In the revision we will add explicit equations describing the multi-modal conditioning fusion (including the cross-attention formulation that integrates text, FG, BG, and mask tokens) and a new figure that diagrams the input embedding and fusion pathway. These additions will explicitly show how background photorealism is preserved via dedicated BG conditioning paths while effects are baked into the FG output. revision: yes
-
Referee: [§4] §4 (Experiments): The abstract states outperformance across three generation modes and support for post-edits without re-inference, yet no specific metrics (e.g., identity preservation scores, FID, or user-study results), tables, or ablation studies comparing against cascade baselines are referenced; without these the central claim that alpha compositing alone suffices cannot be verified.
Authors: Section 4 already contains tables reporting FID, identity preservation (via LPIPS and CLIP similarity), and user-study preference rates across the three generation modes, together with direct comparisons to cascade baselines. We will add explicit forward references from the abstract and introduction to these tables and include one additional ablation that isolates the drift introduced by post-edit harmonization versus our single-pass approach. revision: partial
-
Referee: [§5] §5 (Dataset and Benchmark): The construction of LASAGNA-48K and LASAGNA-BENCH is presented as a contribution, but details on how visual effects were rendered and validated for photorealism, plus the exact annotation protocol for the 242 samples, are needed to assess whether the resources truly enable fair evaluation of effect integration.
Authors: We will expand Section 5 with a dedicated subsection detailing the rendering pipeline (including lighting simulation parameters for shadows and reflections), the photorealism validation procedure (expert visual inspection plus quantitative metrics), and the precise annotation protocol for LASAGNA-BENCH (including annotator guidelines, number of annotators, and inter-annotator agreement statistics). revision: yes
Circularity Check
No significant circularity detected in the framework or claims
full rationale
The paper presents LASAGNA as a new unified neural architecture that produces photorealistic BG and RGBA FG-with-effects in one forward pass, supported by the release of LASAGNA-48K dataset and LASAGNA-BENCH benchmark. No equations, derivations, or load-bearing self-citations appear in the provided text that reduce the central claims to fitted inputs, self-definitions, or prior author results by construction. The approach is justified by architectural choices, new data resources, and experimental outperformance against baselines, remaining self-contained without any reduction of predictions to the inputs themselves.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 2 Pith papers
-
LiWi: Layering in the Wild
LiWi uses an agent-driven data synthesis pipeline to build the LiWi-100k dataset and a model with shadow-guided and degradation-restoration objectives that achieves SoTA performance on RGB L1 and Alpha IoU for natural...
-
LiWi: Layering in the Wild
Introduces LiWi-100k dataset via agent-orchestrated synthesis and a decomposition model with shadow-guided learning and boundary correction that claims state-of-the-art RGB L1 and Alpha IoU on natural images.
Reference graph
Works this paper leans on
-
[1]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report. arXiv preprint arXiv:2502.13923, 2025. 4
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[2]
Improving image genera- tion with better captions.OpenAI blog, 2023
James Betker, Gabriel Goh, Li Jing, Tim Brooks, Jian- feng Wang, Linjie Li, Long Ouyang, Juntang Zhuang, Joyce Lee, Yufei Guo, et al. Improving image genera- tion with better captions.OpenAI blog, 2023. 2
work page 2023
-
[3]
Instructpix2pix: Learning to follow image editing in- structions
Tim Brooks, Aleksander Holynski, and Alexei A Efros. Instructpix2pix: Learning to follow image editing in- structions. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 18392–18402, 2023. 1
work page 2023
-
[4]
Pixart-sigma: Weak-to- strong training of diffusion transformer for 4k text-to- image generation
Junsong Chen, Chongjian Ge, Enze Xie, Yue Wu, Lewei Yao, Xiaozhe Ren, Zhongdao Wang, Ping Luo, Huchuan Lu, and Zhenguo Li. Pixart-sigma: Weak-to- strong training of diffusion transformer for 4k text-to- image generation. InECCV, 2024. 2
work page 2024
-
[5]
Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling
Xiaokang Chen, Chengyue Wu, Zhiyu Wu, Yiyang Ma, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, Chong Ruan, and Ping Luo. Janus- pro: Unified multimodal understanding and genera- tion with data and model scaling.arXiv preprint arXiv:2501.17811, 2025. 2
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[6]
Unireal: Universal image generation and editing via learning real-world dynamics
Xi Chen, Zhifei Zhang, He Zhang, Yuqian Zhou, Soo Ye Kim, Qing Liu, Yijun Li, Jianming Zhang, Nanxuan Zhao, Yilin Wang, et al. Unireal: Universal image generation and editing via learning real-world dynamics. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 12501–12511,
-
[7]
Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, et al. Expanding perfor- mance boundaries of open-source multimodal models with model, data, and test-time scaling.arXiv preprint arXiv:2412.05271, 2024. 4
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[8]
Yusuf Dalva, Yijun Li, Qing Liu, Nanxuan Zhao, Jianming Zhang, Zhe Lin, and Pinar Yanardag. Layerfusion: Harmonized multi-layer text-to-image generation with generative priors.arXiv preprint arXiv:2412.04460, 2024. 2, 3, 5
-
[9]
Emerging Properties in Unified Multimodal Pretraining
Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, et al. Emerging proper- ties in unified multimodal pretraining.arXiv preprint arXiv:2505.14683, 2025. 1, 2
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[10]
Scaling rectified flow transformers for high-resolution image synthesis
Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas M ¨uller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis. InICML, 2024. 2
work page 2024
-
[11]
Scaling rectified flow transformers for high-resolution image synthesis
Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas M ¨uller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis. InForty-first international conference on machine learning, 2024. 1, 2, 3
work page 2024
-
[12]
Alessandro Fontanella, Petru-Daniel Tudosiu, Yongxin Yang, Shifeng Zhang, and Sarah Parisot. Generating compositional scenes via text-to-image rgba instance generation.Advances in Neural Information Process- ing Systems, 37:43864–43893, 2024. 3
work page 2024
-
[13]
Multi-scale and detail-enhanced segment anything model for salient object detection
Shixuan Gao, Pingping Zhang, Tianyu Yan, and Huchuan Lu. Multi-scale and detail-enhanced segment anything model for salient object detection. InPro- ceedings of the 32nd ACM International Conference on Multimedia, pages 9894–9903, 2024. 4
work page 2024
-
[14]
SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation
Yuying Ge, Sijie Zhao, Jinguo Zhu, Yixiao Ge, Kun Yi, Lin Song, Chen Li, Xiaohan Ding, and Ying Shan. Seed-x: Multimodal models with unified multi- granularity comprehension and generation.arXiv preprint arxiv:2404.14396, 2024. 2
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[15]
Dhruba Ghosh, Hannaneh Hajishirzi, and Ludwig Schmidt. Geneval: An object-focused framework for evaluating text-to-image alignment.Advances in Neu- ral Information Processing Systems, 36:52132–52152,
-
[16]
Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium.Advances in neural information process- ing systems, 30, 2017. 5
work page 2017
-
[17]
Dingbang Huang, Wenbo Li, Yifei Zhao, Xinyu Pan, Yanhong Zeng, and Bo Dai. Psdiffusion: Harmo- nized multi-layer image generation via layout and ap- pearance alignment.arXiv preprint arXiv:2505.11468,
-
[18]
Flux family models.https:// huggingface.co/docs/diffusers/main/ en/api/pipelines/flux, 2025
Hugging Face. Flux family models.https:// huggingface.co/docs/diffusers/main/ en/api/pipelines/flux, 2025. 5, 6
work page 2025
-
[19]
Brushnet: A plug-and-play image inpainting model with decomposed dual-branch diffu- sion
Xuan Ju, Xian Liu, Xintao Wang, Yuxuan Bian, Ying Shan, and Qiang Xu. Brushnet: A plug-and-play image inpainting model with decomposed dual-branch diffu- sion. InEuropean Conference on Computer Vision, pages 150–168. Springer, 2024. 3
work page 2024
-
[20]
Kyoungkook Kang, Gyujin Sim, Geonung Kim, Donguk Kim, Seungho Nam, and Sunghyun Cho. Lay- eringdiff: Layered image synthesis via generation, then disassembly with generative knowledge.arXiv preprint arXiv:2501.01197, 2025. 3
-
[21]
Bingxin Ke, Kevin Qu, Tianfu Wang, Nando Metzger, Shengyu Huang, Bo Li, Anton Obukhov, and Kon- rad Schindler. Marigold: Affordable adaptation of diffusion-based image generators for image analysis. arXiv preprint arXiv:2505.09358, 2025. 4
-
[22]
Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Segment anything. InProceedings of the IEEE/CVF international conference on computer vi- sion, pages 4015–4026, 2023. 3
work page 2023
- [23]
-
[24]
Black Forest Labs, Stephen Batifol, Andreas Blattmann, Frederic Boesel, Saksham Consul, 9 Cyril Diagne, Tim Dockhorn, Jack English, Zion English, Patrick Esser, et al. Flux. 1 kontext: Flow matching for in-context image generation and editing in latent space.arXiv preprint arXiv:2506.15742,
work page internal anchor Pith review Pith/arXiv arXiv
-
[25]
UniWorld-V1: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation
Bin Lin, Zongjian Li, Xinhua Cheng, Yuwei Niu, Yang Ye, Xianyi He, Shenghai Yuan, Wangbo Yu, Shaodong Wang, Yunyang Ge, et al. Uniworld: High-resolution semantic encoders for unified visual understanding and generation.arXiv preprint arXiv:2506.03147, 2025. 1
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[26]
Microsoft coco: Common ob- jects in context
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C Lawrence Zitnick. Microsoft coco: Common ob- jects in context. InEuropean conference on computer vision, pages 740–755. Springer, 2014. 4, 5
work page 2014
-
[27]
Flow Matching for Generative Modeling
Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Max- imilian Nickel, and Matt Le. Flow matching for gen- erative modeling.arXiv preprint arXiv:2210.02747,
work page internal anchor Pith review Pith/arXiv arXiv
-
[28]
World Model on Million-Length Video And Language With Blockwise RingAttention
Hao Liu, Wilson Yan, Matei Zaharia, and Pieter Abbeel. World model on million-length video and language with ringattention.arXiv preprint arxiv:2402.08268, 2024. 2
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[29]
Step1X-Edit: A Practical Framework for General Image Editing
Shiyu Liu, Yucheng Han, Peng Xing, Fukun Yin, Rui Wang, Wei Cheng, Jiaqi Liao, Yingming Wang, Hong- hao Fu, Chunrui Han, et al. Step1x-edit: A practical framework for general image editing.arXiv preprint arXiv:2504.17761, 2025. 2, 1
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[30]
Seg-Zero: Reasoning-Chain Guided Segmentation via Cognitive Reinforcement
Yuqi Liu, Bohao Peng, Zhisheng Zhong, Zihao Yue, Fanbin Lu, Bei Yu, and Jiaya Jia. Seg-zero: Reasoning- chain guided segmentation via cognitive reinforce- ment.arXiv preprint arXiv:2503.06520, 2025. 7
work page internal anchor Pith review arXiv 2025
-
[31]
OpenAI. Gpt image 1.https://platform. openai.com/docs/models/gpt- image- 1,
-
[32]
Accessed: 2025-11-13. 5, 6
work page 2025
-
[33]
Transfer between Modalities with MetaQueries
Xichen Pan, Satya Narayan Shukla, Aashu Singh, Zhuokai Zhao, Shlok Kumar Mishra, Jialiang Wang, Zhiyang Xu, Jiuhai Chen, Kunpeng Li, Felix Juefei- Xu, Ji Hou, and Saining Xie. Transfer be- tween modalities with metaqueries.arXiv preprint arXiv:2504.06256, 2025. 2
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[34]
On aliased resizing and surprising subtleties in gan evalu- ation
Gaurav Parmar, Richard Zhang, and Jun-Yan Zhu. On aliased resizing and surprising subtleties in gan evalu- ation. InCVPR, 2022. 5
work page 2022
-
[35]
Scalable diffu- sion models with transformers
William Peebles and Saining Xie. Scalable diffu- sion models with transformers. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 4195–4205, 2023. 3
work page 2023
-
[36]
SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis
Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas M ¨uller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffu- sion models for high-resolution image synthesis.arXiv preprint arXiv:2307.01952, 2023. 1, 2, 3
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[37]
Sdxl: Improving latent diffusion models for high-resolution image synthesis
Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas M ¨uller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis. InICLR,
-
[38]
Tokenflow: Unified image tokenizer for multimodal understanding and generation
Liao Qu, Huichao Zhang, Yiheng Liu, Xu Wang, Yi Jiang, Yiming Gao, Hu Ye, Daniel K Du, Zehuan Yuan, and Xinglong Wu. Tokenflow: Unified image tokenizer for multimodal understanding and generation.arXiv preprint arXiv:2412.03069, 2024. 2
-
[39]
Colin Raffel, Noam Shazeer, Adam Roberts, Kather- ine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer.Jour- nal of machine learning research, 21(140):1–67, 2020. 3
work page 2020
-
[40]
Zero-shot text-to-image generation
Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea V oss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation. In International conference on machine learning, pages 8821–8831. Pmlr, 2021. 1, 2, 3
work page 2021
-
[41]
Hierarchical Text-Conditional Image Generation with CLIP Latents
Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text- conditional image generation with clip latents.arXiv preprint arxiv:2204.06125, 2022. 2
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[42]
SAM 2: Segment Anything in Images and Videos
Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Rong- hang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman R¨adle, Chloe Rolland, Laura Gustafson, et al. Sam 2: Segment anything in images and videos. arXiv preprint arXiv:2408.00714, 2024. 3
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[43]
Grounded SAM: Assembling Open-World Models for Diverse Visual Tasks
Tianhe Ren, Shilong Liu, Ailing Zeng, Jing Lin, Kun- chang Li, He Cao, Jiayu Chen, Xinyu Huang, Yukang Chen, Feng Yan, et al. Grounded sam: Assembling open-world models for diverse visual tasks.arXiv preprint arXiv:2401.14159, 2024. 3
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[44]
High- resolution image synthesis with latent diffusion models
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High- resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022. 1, 2, 3
work page 2022
-
[45]
Chameleon: Mixed-Modal Early-Fusion Foundation Models
Chameleon Team. Chameleon: Mixed-modal early-fusion foundation models.arXiv preprint arXiv:2405.09818, 2024. 2
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[46]
Mulan: A multi layer annotated dataset for controllable text-to-image generation
Petru-Daniel Tudosiu, Yongxin Yang, Shifeng Zhang, Fei Chen, Steven McDonagh, Gerasimos Lampouras, Ignacio Iacobacci, and Sarah Parisot. Mulan: A multi layer annotated dataset for controllable text-to-image generation. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 22413–22422, 2024. 2, 3, 4, 5
work page 2024
-
[47]
Unsplash: Free high-resolution photos
Unsplash. Unsplash: Free high-resolution photos. https://unsplash.com/. 5
-
[48]
Illume: Illuminating your llms to see, draw, and self-enhance.arXiv preprint arXiv:2412.06673, 2024a
Chunwei Wang, Guansong Lu, Junwei Yang, Runhui Huang, Jianhua Han, Lu Hou, Wei Zhang, and Hang Xu. Illume: Illuminating your llms to see, draw, and self-enhance.arXiv preprint arXiv:2412.06673, 2024. 2
-
[49]
Tianyu Wang, Xiaowei Hu, Pheng-Ann Heng, and Chi- Wing Fu. Instance shadow detection with a single- stage detector.IEEE transactions on pattern analysis and machine intelligence, 45(3):3259–3273, 2022. 4, 5
work page 2022
-
[50]
Emu3: Next-Token Prediction is All You Need
Xinlong Wang, Xiaosong Zhang, Zhengxiong Luo, Quan Sun, Yufeng Cui, Jinsheng Wang, Fan Zhang, Yueze Wang, Zhen Li, Qiying Yu, et al. Emu3: Next-token prediction is all you need.arXiv preprint arxiv:2409.18869, 2024. 2
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[51]
Runpu Wei, Zijin Yin, Shuo Zhang, Lanxiang Zhou, Xueyi Wang, Chao Ban, Tianwei Cao, Hao Sun, Zhongjiang He, Kongming Liang, et al. Omnieraser: Remove objects and their effects in images with paired 10 video-frame data.arXiv preprint arXiv:2501.07397,
-
[52]
Object- drop: Bootstrapping counterfactuals for photorealistic object removal and insertion
Daniel Winter, Matan Cohen, Shlomi Fruchter, Yael Pritch, Alex Rav-Acha, and Yedid Hoshen. Object- drop: Bootstrapping counterfactuals for photorealistic object removal and insertion. InEuropean Conference on Computer Vision, pages 112–129. Springer, 2024. 3, 5
work page 2024
-
[53]
Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation
Chengyue Wu, Xiaokang Chen, Zhiyu Wu, Yiyang Ma, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, Chong Ruan, and Ping Luo. Janus: Decoupling visual encoding for unified multi- modal understanding and generation.arXiv preprint arXiv:2410.13848, 2024. 2
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[54]
Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al. Qwen-image technical re- port.arXiv preprint arXiv:2508.02324, 2025. 1, 2, 3, 5, 6, 7
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[55]
OmniGen2: Towards Instruction-Aligned Multimodal Generation
Chenyuan Wu, Pengfei Zheng, Ruiran Yan, Shitao Xiao, Xin Luo, Yueze Wang, Wanli Li, Xiyan Jiang, Yexin Liu, Junjie Zhou, et al. Omnigen2: Exploration to advanced multimodal generation.arXiv preprint arXiv:2506.18871, 2025. 1
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[56]
Show-o: One Single Transformer to Unify Multimodal Understanding and Generation
Jinheng Xie, Weijia Mao, Zechen Bai, David Junhao Zhang, Weihao Wang, Kevin Qinghong Lin, Yuchao Gu, Zhijie Chen, Zhenheng Yang, and Mike Zheng Shou. Show-o: One single transformer to unify mul- timodal understanding and generation.arXiv preprint arxiv:2408.12528, 2024. 2
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[57]
Generative image layer decomposition with visual effects
Jinrui Yang, Qing Liu, Yijun Li, Soo Ye Kim, Daniil Pakhomov, Mengwei Ren, Jianming Zhang, Zhe Lin, Cihang Xie, and Yuyin Zhou. Generative image layer decomposition with visual effects. InProceedings of the Computer Vision and Pattern Recognition Confer- ence, pages 7643–7653, 2025. 3, 4, 5
work page 2025
-
[58]
Siwei Yang, Mude Hui, Bingchen Zhao, Yuyin Zhou, Nataniel Ruiz, and Cihang Xie. Complexedit: Cot-like instruction generation for complexity- controllable image editing benchmark.arXiv preprint arXiv:2504.13143, 2025. 5, 1
-
[59]
ImgEdit: A Unified Image Editing Dataset and Benchmark
Yang Ye, Xianyi He, Zongjian Li, Bin Lin, Shenghai Yuan, Zhiyuan Yan, Bohan Hou, and Li Yuan. Imgedit: A unified image editing dataset and benchmark.arXiv preprint arXiv:2505.20275, 2025. 6, 1
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[60]
Anyedit: Mastering unified high-quality image editing for any idea
Qifan Yu, Wei Chow, Zhongqi Yue, Kaihang Pan, Yang Wu, Xiaoyang Wan, Juncheng Li, Siliang Tang, Han- wang Zhang, and Yueting Zhuang. Anyedit: Mastering unified high-quality image editing for any idea. InPro- ceedings of the Computer Vision and Pattern Recogni- tion Conference, pages 26125–26135, 2025. 1
work page 2025
-
[61]
Kai Zhang, Lingbo Mo, Wenhu Chen, Huan Sun, and Yu Su. Magicbrush: A manually annotated dataset for instruction-guided image editing.Advances in Neu- ral Information Processing Systems, 36:31428–31449,
-
[62]
Transparent image layer diffusion using latent trans- parency.arXiv preprint arXiv:2402.17113, 2024
Lvmin Zhang and Maneesh Agrawala. Transparent im- age layer diffusion using latent transparency.arXiv preprint arXiv:2402.17113, 2024. 2, 3, 5, 6, 7
-
[63]
Adding conditional control to text-to-image diffusion models
Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. InProceedings of the IEEE/CVF interna- tional conference on computer vision, pages 3836– 3847, 2023. 2
work page 2023
-
[64]
The unreasonable ef- fectiveness of deep features as a perceptual metric
Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable ef- fectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on computer vi- sion and pattern recognition, pages 586–595, 2018. 5
work page 2018
-
[65]
Xinyang Zhang, Wentian Zhao, Xin Lu, and Jeff Chien. Text2layer: Layered image genera- tion using latent diffusion model.arXiv preprint arXiv:2307.09781, 2023. 3
-
[66]
Zechuan Zhang, Ji Xie, Yu Lu, Zongxin Yang, and Yi Yang. In-context edit: Enabling instructional image editing with in-context generation in large scale dif- fusion transformer.arXiv preprint arXiv:2504.20690,
work page internal anchor Pith review arXiv
-
[67]
Ultraedit: Instruction-based fine-grained image editing at scale
Haozhe Zhao, Xiaojian Shawn Ma, Liang Chen, Shuzheng Si, Rujie Wu, Kaikai An, Peiyu Yu, Min- jia Zhang, Qing Li, and Baobao Chang. Ultraedit: Instruction-based fine-grained image editing at scale. Advances in Neural Information Processing Systems, 37:3058–3093, 2024. 2, 1
work page 2024
-
[68]
ObjectClear: Complete object removal via object-effect attention,
Jixin Zhao, Shangchen Zhou, Zhouxia Wang, Peiqing Yang, and Chen Change Loy. Objectclear: Com- plete object removal via object-effect attention.arXiv preprint arXiv:2505.22636, 2025. 3
-
[69]
Yichuan Zhong, Yapeng Tian, et al. Tp-blend: Textual- prompt attention pairing for precise object-style blend- ing in diffusion models.Transactions on Machine Learning Research, 2025. 2
work page 2025
-
[70]
Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model
Chunting Zhou, Lili Yu, Arun Babu, Kushal Tiru- mala, Michihiro Yasunaga, Leonid Shamis, Jacob Kahn, Xuezhe Ma, Luke Zettlemoyer, and Omer Levy. Transfusion: Predict the next token and diffuse im- ages with one multi-modal model.arXiv preprint arxiv:2408.11039, 2024. 2
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[71]
Junhao Zhuang, Yanhong Zeng, Wenran Liu, Chun Yuan, and Kai Chen. A task is worth one word: Learn- ing with task prompts for high-quality versatile image inpainting. InEuropean Conference on Computer Vi- sion, pages 195–211. Springer, 2024. 3 11 Controllable Layered Image Generation for Real-World Editing Supplementary Material Model Addition Background M...
work page 2024
-
[72]
Comparison on Public Benchmarks To comprehensively evaluate the ability of LASAGNA, we additionally evaluate LASAGNAon two public benchmarks: ImgEdit-Bench [58] and GenEval [15]. In ImgEdit-Bench, the“Addition” and “Background” tasks basically match our generation modes: FG Cond and BG Cond. GenEval matches our Text2All gener- ation mode. The results are ...
-
[73]
11, we provide more results of our model under three different modes (FG Gen, BG Gen, and Text2All)
More Qualitative Results from LASAGNA As shown in Fig.9, Fig.10, and Fig. 11, we provide more results of our model under three different modes (FG Gen, BG Gen, and Text2All). Specifically, in Fig.9 and Fig.10, for the FG Gen and BG Gen modes, we further demonstrate more flexible applications. We can fix the background and generate different foregrounds, o...
-
[74]
12, we provide more samples from LASAGNA-48K
More Samples from LASAGNA-48K As shown in Fig. 12, we provide more samples from LASAGNA-48K
-
[75]
13, we provide more samples from LASAGNABENCH
More Samples from LASAGNABENCH As shown in Fig. 13, we provide more samples from LASAGNABENCH
-
[76]
Detailed Scores of the GPT Score As shown inTable 4in the main manuscript, we pro- vide the “GPT Score”, which is the average score of the instruction-following and identity-preserving met- rics proposed by Complex-Edit [57]. Here, we addi- tionally provide the original score in Table 10. We run the same prompt three times and take the aver- age as the or...
-
[77]
The results show the superiority of our generation paradigm
Details of the Metric in Section 4.4 (Layer Editing with Visual Effects) To illustrate the necessity of our proposed generation paradigm (Layer Editing with Visual Effects), we con- duct quantitative experiments inSection 4.4of the main manuscript. The results show the superiority of our generation paradigm. We compare three editing modes:Instruct Editing...
-
[78]
Based on the context of our question and the value of the parameter, GPT-5 generates an appropriate nat- ural language description, as shown in the template in Fig. 14. In the subsequent Complex Editing task, we combine the two types of instructions accordingly. This helps ensure that all three types of editing per- form the same actions as much as possib...
-
[79]
FG Gen” denotes background-conditioned foreground layer generation, “BG Gen
Training Samples of Data Curator InSection 3.2, LASAGNA-48K Dataset in the main manuscript, we demonstrate the complete data con- struction pipeline. The data curator is a key compo- nent in the data construction pipeline. To obtain high- quality data filtering results, we carefully annotated about 30K samples by humans, as shown in Fig. 15, to train the ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.