pith. sign in

arxiv: 2601.15507 · v2 · submitted 2026-01-21 · 💻 cs.CV

A Unified and Controllable Framework for Layered Image Generation with Visual Effects

Pith reviewed 2026-05-16 11:51 UTC · model grok-4.3

classification 💻 cs.CV
keywords layered image generationvisual effectsalpha compositingimage editingunified modelphotorealistic synthesisdatasetbenchmark
0
0 comments X

The pith

LASAGNA generates a photorealistic background and RGBA foreground with integrated effects in one forward pass.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents LASAGNA, a framework that generates photorealistic backgrounds and RGBA foregrounds with effects like shadows in a single forward pass. This design treats visual effects as part of the foreground layer. Common consumer edits can then be performed using only alpha compositing without calling the model again. This avoids the identity drift that happens in pipelines with multiple editing steps. The method unifies handling of text prompts, foregrounds, backgrounds, and masks.

Core claim

LASAGNA generates a photorealistic background and an RGBA foreground with compelling visual effects in a single forward pass. By treating object-associated visual effects as part of the foreground layer, it supports edits via alpha compositing alone, eliminating identity drift from cascade pipelines. It handles diverse inputs in a unified architecture and introduces a new dataset and benchmark for layered generation.

What carries the argument

The LASAGNA unified neural architecture that synthesizes background and foreground layers with integrated visual effects together.

If this is right

  • Enables translation, scaling, recoloring, and duplication of objects using only alpha compositing.
  • Eliminates the need for post-edit harmonization models.
  • Outperforms prior methods in generation quality across multiple modes.
  • Supports a wide range of post-edits without model re-inference.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This layered approach could simplify interactive image editing tools by reducing computation for each edit.
  • The released dataset of 48K triplets may enable training of specialized models for effect prediction.
  • Extending the single-pass design to video sequences could produce consistent layered videos.

Load-bearing premise

A single unified model can generate both photorealistic backgrounds and foregrounds with integrated visual effects reliably enough that alpha compositing preserves identity without further harmonization.

What would settle it

Compositing the generated foreground onto a new background and observing whether the object identity remains unchanged with no additional processing or model calls.

Figures

Figures reproduced from arXiv: 2601.15507 by Cihang Xie, Jinrui Yang, Letian Zhang, Mengwei Ren, Qing Liu, Yijun Li, Yuyin Zhou, Zhe Lin.

Figure 1
Figure 1. Figure 1: Layered generation with LASAGNA. (a) Our framework supports three generation modes: background-conditioned foreground generation, foreground-conditioned background generation, and text-to-all layer generation, which flexibly handle different inputs and jointly synthesize coherent, high-quality composites, backgrounds, and transparent foregrounds with real￾istic visual effects (e.g., shadows and reflections… view at source ↗
Figure 2
Figure 2. Figure 2: Pipeline of LASAGNA framework. We formulate the joint generation of composite images, backgrounds, and foregrounds as a flexible, layer-conditional denoising task. This single framework supports multiple workflows, including FG Gen, BG Gen, and Text2All. We use a unified input representation with learnable embeddings that distinguish different roles of visual latents (noise, BG, FG, and mask) across tasks,… view at source ↗
Figure 3
Figure 3. Figure 3: Data construction pipeline. Starting with ex￾isting datasets, we implement a four-stage data construc￾tion pipeline leveraging off-the-shelf models with a custom￾trained data curator. This process yields a high-quality dataset as the foundation for subsequent model training. scale dataset of over 48K high-quality image triplets. Data sources. We construct our dataset by curating samples from three public s… view at source ↗
Figure 4
Figure 4. Figure 4: Samples of LASAGNA-48K and LASAGNABENCH. Each sample consists of a composite image, a clean back￾ground, and a foreground layer with visual effects, along with corresponding captions for all components [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Layer generation compared with state-of-the-art image generation and editing models. We compare LASAGNA with Flux.1 [18, 24], Qwen-Image-Edit [53], and gpt-image-1[High] [31]. (a) Across three distinct generation tasks, LASAGNA consistently achieves superior inter-layer coherence and consistency. In contrast, competing models often fail to maintain these properties. (b) Moreover, by generating foregrounds … view at source ↗
Figure 6
Figure 6. Figure 6: Layer generation compared with LayerDiffuse [61]. In FG Gen, our model produces object with appropriate size and position, along with realistic shadows consistent with the background. In BG Gen and Text2All, our model produces visually consistent results across all layers. Furthermore, it can generate new foregrounds with corresponding visual effects, enabling flexible and realistic post-editing. ing, we e… view at source ↗
Figure 7
Figure 7. Figure 7: Layer editing with visual effects. We demon￾strate the benefits of explicit layer representations with vi￾sual effects by comparing three paradigms: Instruct Edit￾ing, Layer Editing, and Layer Editing with Visual Effects (LASAGNA). Across recoloring, spatial, and compositional editing tasks, the lack of explicit layer representations makes Instruct Editing prone to unintended changes and less re￾sponsive t… view at source ↗
Figure 8
Figure 8. Figure 8: Diverse creative applications driven by our model. We leverage both Text2All and FG Gen modes to jointly guide the synthesis process, unlocking a broader range of editing possibilities and producing diverse, visually appealing results. 4.5. Ablations We ablate our design choices in [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: More results from LASAGNA under the background-conditioned foreground generation (FG Gen) mode. 3 [PITH_FULL_IMAGE:figures/full_fig_p014_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: More results from LASAGNA under the foreground-conditioned background generation (BG Gen) mode. Text-to-all layer generation Comp: A dog wearing hat and outfit sitting on the beach FG: A dog wearing hat and outfit Comp: A close-up of a kiwi fruit on a textured surface, surrounded by other fruits like an orange, peach, and lemon. FG: A fuzzy brown kiwi fruit. Inputs Text prompt BG FG w/ visual effect Compo… view at source ↗
Figure 11
Figure 11. Figure 11: More results from LASAGNA under the text-to-all layer generation (Text2All) mode. 4 [PITH_FULL_IMAGE:figures/full_fig_p015_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: More samples of LASAGNA-48K. Each sample consists of a composite image, a clean background, and a fore￾ground layer with visual effects, along with corresponding captions for all components. 5 [PITH_FULL_IMAGE:figures/full_fig_p016_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: More samples of LASAGNABENCH. Each sample consists of a composite image, a foreground layer with visual effects, and a clean background along with corresponding captions for all components. Foreground with visual effects anno￾tated by professional annotators. Background are decomposed by expert models or captured by camera. 1.Red: Change the {Object} bright red. 2.Orange: Change the {Object} to vibrant or… view at source ↗
Figure 14
Figure 14. Figure 14: When performing instruct editing, the content within [PITH_FULL_IMAGE:figures/full_fig_p017_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Training Samples of Data Curator. The red dashed box indicates the main problematic area in the bad cases. 7 [PITH_FULL_IMAGE:figures/full_fig_p018_15.png] view at source ↗
read the original abstract

Recent image generation models produce impressive composites, but often fail to preserve the identity of user-provided content when editing specific elements: the surrounding scene may shift, and even the edited object's appearance can drift from the original. Layered representation offer a natural remedy--they allow users to independently manipulate individual elements--but existing layered methods typically produce transparent foregrounds without realistic visual effects such as shadows and reflections, forcing the use of a second harmonization model after every edit, which in turn introduces drift. To overcome these limitations, we present LASAGNA, which generates a photorealistic background (BG) and an RGBA foreground with compelling visual effects in a single forward pass. By treating object-associated visual effects as part of the foreground (FG) layer, LASAGNA supports the dominant class of consumer edits (e.g., translation, scaling, recoloring, duplication) via alpha compositing alone, without invoking any model post-edit, thereby eliminating identity drift introduced by cascade editing pipelines. This single-pass design contrasts with prior layered methods that rely on separate expert models for each task. LASAGNA handles diverse conditional inputs--text prompts, FG, BG, and location masks--within a unified architecture. We further release two community resources: LASAGNA-48K, the first public dataset of 48K layered image triplets with photorealistic visual effects, and LASAGNA-BENCH, the first standardized benchmark for layer-centric generation and editing, comprising 242 expert-annotated samples across six diverse sources. Experiments show that LASAGNA outperforms both general-purpose editors and prior layered methods across three generation modes, and supports a wide range of post-edits without any model re-inference.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces LASAGNA, a unified neural model that generates a photorealistic background layer and an RGBA foreground layer (with integrated visual effects such as shadows and reflections) in a single forward pass. By baking effects into the foreground, the approach enables common consumer edits (translation, scaling, recoloring, duplication) via standard alpha compositing without any additional model inference or harmonization step, thereby avoiding identity drift common in cascaded pipelines. The manuscript also releases the LASAGNA-48K dataset (48K layered triplets) and LASAGNA-BENCH (242 expert-annotated samples across six sources), and reports that LASAGNA outperforms both general-purpose editors and prior layered methods across three generation modes while supporting post-edits without re-inference.

Significance. If the single-pass integration of effects and the quantitative superiority hold, the work would meaningfully advance controllable layered image generation by simplifying editing workflows and reducing drift. The public release of a large-scale dataset with photorealistic effects and a standardized benchmark would provide lasting value to the community, enabling reproducible comparisons on layer-centric tasks.

major comments (3)
  1. [§3] §3 (Method): The unified conditioning mechanism for handling text prompts, FG, BG, and location masks in one architecture is described at a high level but lacks explicit equations or diagrams showing how the inputs are fused without compromising BG photorealism or FG effect consistency; this is load-bearing for the single-pass claim.
  2. [§4] §4 (Experiments): The abstract states outperformance across three generation modes and support for post-edits without re-inference, yet no specific metrics (e.g., identity preservation scores, FID, or user-study results), tables, or ablation studies comparing against cascade baselines are referenced; without these the central claim that alpha compositing alone suffices cannot be verified.
  3. [§5] §5 (Dataset and Benchmark): The construction of LASAGNA-48K and LASAGNA-BENCH is presented as a contribution, but details on how visual effects were rendered and validated for photorealism, plus the exact annotation protocol for the 242 samples, are needed to assess whether the resources truly enable fair evaluation of effect integration.
minor comments (2)
  1. [Abstract] Abstract: 'Layered representation offer' should read 'Layered representations offer'.
  2. [Notation] Notation: The distinction between BG and FG layers is clear, but the paper should explicitly define the RGBA representation and alpha-compositing formula used for edits to aid reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments, which help clarify the presentation of our contributions. We address each major point below and will revise the manuscript to incorporate the requested details while preserving the core claims.

read point-by-point responses
  1. Referee: [§3] §3 (Method): The unified conditioning mechanism for handling text prompts, FG, BG, and location masks in one architecture is described at a high level but lacks explicit equations or diagrams showing how the inputs are fused without compromising BG photorealism or FG effect consistency; this is load-bearing for the single-pass claim.

    Authors: We agree that the current description is high-level. In the revision we will add explicit equations describing the multi-modal conditioning fusion (including the cross-attention formulation that integrates text, FG, BG, and mask tokens) and a new figure that diagrams the input embedding and fusion pathway. These additions will explicitly show how background photorealism is preserved via dedicated BG conditioning paths while effects are baked into the FG output. revision: yes

  2. Referee: [§4] §4 (Experiments): The abstract states outperformance across three generation modes and support for post-edits without re-inference, yet no specific metrics (e.g., identity preservation scores, FID, or user-study results), tables, or ablation studies comparing against cascade baselines are referenced; without these the central claim that alpha compositing alone suffices cannot be verified.

    Authors: Section 4 already contains tables reporting FID, identity preservation (via LPIPS and CLIP similarity), and user-study preference rates across the three generation modes, together with direct comparisons to cascade baselines. We will add explicit forward references from the abstract and introduction to these tables and include one additional ablation that isolates the drift introduced by post-edit harmonization versus our single-pass approach. revision: partial

  3. Referee: [§5] §5 (Dataset and Benchmark): The construction of LASAGNA-48K and LASAGNA-BENCH is presented as a contribution, but details on how visual effects were rendered and validated for photorealism, plus the exact annotation protocol for the 242 samples, are needed to assess whether the resources truly enable fair evaluation of effect integration.

    Authors: We will expand Section 5 with a dedicated subsection detailing the rendering pipeline (including lighting simulation parameters for shadows and reflections), the photorealism validation procedure (expert visual inspection plus quantitative metrics), and the precise annotation protocol for LASAGNA-BENCH (including annotator guidelines, number of annotators, and inter-annotator agreement statistics). revision: yes

Circularity Check

0 steps flagged

No significant circularity detected in the framework or claims

full rationale

The paper presents LASAGNA as a new unified neural architecture that produces photorealistic BG and RGBA FG-with-effects in one forward pass, supported by the release of LASAGNA-48K dataset and LASAGNA-BENCH benchmark. No equations, derivations, or load-bearing self-citations appear in the provided text that reduce the central claims to fitted inputs, self-definitions, or prior author results by construction. The approach is justified by architectural choices, new data resources, and experimental outperformance against baselines, remaining self-contained without any reduction of predictions to the inputs themselves.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; the framework is presented as a standard neural architecture trained on the newly introduced dataset.

pith-pipeline@v0.9.0 · 5630 in / 1222 out tokens · 59540 ms · 2026-05-16T11:51:31.654153+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. LiWi: Layering in the Wild

    cs.CV 2026-05 unverdicted novelty 7.0

    LiWi uses an agent-driven data synthesis pipeline to build the LiWi-100k dataset and a model with shadow-guided and degradation-restoration objectives that achieves SoTA performance on RGB L1 and Alpha IoU for natural...

  2. LiWi: Layering in the Wild

    cs.CV 2026-05 unverdicted novelty 7.0

    Introduces LiWi-100k dataset via agent-orchestrated synthesis and a decomposition model with shadow-guided learning and boundary correction that claims state-of-the-art RGB L1 and Alpha IoU on natural images.

Reference graph

Works this paper leans on

79 extracted references · 79 canonical work pages · cited by 1 Pith paper · 25 internal anchors

  1. [1]

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report. arXiv preprint arXiv:2502.13923, 2025. 4

  2. [2]

    Improving image genera- tion with better captions.OpenAI blog, 2023

    James Betker, Gabriel Goh, Li Jing, Tim Brooks, Jian- feng Wang, Linjie Li, Long Ouyang, Juntang Zhuang, Joyce Lee, Yufei Guo, et al. Improving image genera- tion with better captions.OpenAI blog, 2023. 2

  3. [3]

    Instructpix2pix: Learning to follow image editing in- structions

    Tim Brooks, Aleksander Holynski, and Alexei A Efros. Instructpix2pix: Learning to follow image editing in- structions. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 18392–18402, 2023. 1

  4. [4]

    Pixart-sigma: Weak-to- strong training of diffusion transformer for 4k text-to- image generation

    Junsong Chen, Chongjian Ge, Enze Xie, Yue Wu, Lewei Yao, Xiaozhe Ren, Zhongdao Wang, Ping Luo, Huchuan Lu, and Zhenguo Li. Pixart-sigma: Weak-to- strong training of diffusion transformer for 4k text-to- image generation. InECCV, 2024. 2

  5. [5]

    Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling

    Xiaokang Chen, Chengyue Wu, Zhiyu Wu, Yiyang Ma, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, Chong Ruan, and Ping Luo. Janus- pro: Unified multimodal understanding and genera- tion with data and model scaling.arXiv preprint arXiv:2501.17811, 2025. 2

  6. [6]

    Unireal: Universal image generation and editing via learning real-world dynamics

    Xi Chen, Zhifei Zhang, He Zhang, Yuqian Zhou, Soo Ye Kim, Qing Liu, Yijun Li, Jianming Zhang, Nanxuan Zhao, Yilin Wang, et al. Unireal: Universal image generation and editing via learning real-world dynamics. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 12501–12511,

  7. [7]

    Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

    Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, et al. Expanding perfor- mance boundaries of open-source multimodal models with model, data, and test-time scaling.arXiv preprint arXiv:2412.05271, 2024. 4

  8. [8]

    Layerfusion: Harmonized multi-layer text-to-image generation with generative priors.arXiv preprint arXiv:2412.04460, 2024

    Yusuf Dalva, Yijun Li, Qing Liu, Nanxuan Zhao, Jianming Zhang, Zhe Lin, and Pinar Yanardag. Layerfusion: Harmonized multi-layer text-to-image generation with generative priors.arXiv preprint arXiv:2412.04460, 2024. 2, 3, 5

  9. [9]

    Emerging Properties in Unified Multimodal Pretraining

    Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, et al. Emerging proper- ties in unified multimodal pretraining.arXiv preprint arXiv:2505.14683, 2025. 1, 2

  10. [10]

    Scaling rectified flow transformers for high-resolution image synthesis

    Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas M ¨uller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis. InICML, 2024. 2

  11. [11]

    Scaling rectified flow transformers for high-resolution image synthesis

    Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas M ¨uller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis. InForty-first international conference on machine learning, 2024. 1, 2, 3

  12. [12]

    Generating compositional scenes via text-to-image rgba instance generation.Advances in Neural Information Process- ing Systems, 37:43864–43893, 2024

    Alessandro Fontanella, Petru-Daniel Tudosiu, Yongxin Yang, Shifeng Zhang, and Sarah Parisot. Generating compositional scenes via text-to-image rgba instance generation.Advances in Neural Information Process- ing Systems, 37:43864–43893, 2024. 3

  13. [13]

    Multi-scale and detail-enhanced segment anything model for salient object detection

    Shixuan Gao, Pingping Zhang, Tianyu Yan, and Huchuan Lu. Multi-scale and detail-enhanced segment anything model for salient object detection. InPro- ceedings of the 32nd ACM International Conference on Multimedia, pages 9894–9903, 2024. 4

  14. [14]

    SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation

    Yuying Ge, Sijie Zhao, Jinguo Zhu, Yixiao Ge, Kun Yi, Lin Song, Chen Li, Xiaohan Ding, and Ying Shan. Seed-x: Multimodal models with unified multi- granularity comprehension and generation.arXiv preprint arxiv:2404.14396, 2024. 2

  15. [15]

    Geneval: An object-focused framework for evaluating text-to-image alignment.Advances in Neu- ral Information Processing Systems, 36:52132–52152,

    Dhruba Ghosh, Hannaneh Hajishirzi, and Ludwig Schmidt. Geneval: An object-focused framework for evaluating text-to-image alignment.Advances in Neu- ral Information Processing Systems, 36:52132–52152,

  16. [16]

    Gans trained by a two time-scale update rule converge to a local nash equilibrium.Advances in neural information process- ing systems, 30, 2017

    Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium.Advances in neural information process- ing systems, 30, 2017. 5

  17. [17]

    Psdiffusion: Harmo- nized multi-layer image generation via layout and ap- pearance alignment.arXiv preprint arXiv:2505.11468,

    Dingbang Huang, Wenbo Li, Yifei Zhao, Xinyu Pan, Yanhong Zeng, and Bo Dai. Psdiffusion: Harmo- nized multi-layer image generation via layout and ap- pearance alignment.arXiv preprint arXiv:2505.11468,

  18. [18]

    Flux family models.https:// huggingface.co/docs/diffusers/main/ en/api/pipelines/flux, 2025

    Hugging Face. Flux family models.https:// huggingface.co/docs/diffusers/main/ en/api/pipelines/flux, 2025. 5, 6

  19. [19]

    Brushnet: A plug-and-play image inpainting model with decomposed dual-branch diffu- sion

    Xuan Ju, Xian Liu, Xintao Wang, Yuxuan Bian, Ying Shan, and Qiang Xu. Brushnet: A plug-and-play image inpainting model with decomposed dual-branch diffu- sion. InEuropean Conference on Computer Vision, pages 150–168. Springer, 2024. 3

  20. [20]

    Layeringdiff: Layered image synthesis via generation, then disassembly with generative knowledge.arXiv preprint arXiv:2501.01197, 2025

    Kyoungkook Kang, Gyujin Sim, Geonung Kim, Donguk Kim, Seungho Nam, and Sunghyun Cho. Lay- eringdiff: Layered image synthesis via generation, then disassembly with generative knowledge.arXiv preprint arXiv:2501.01197, 2025. 3

  21. [21]

    Marigold: Affordable adaptation of diffusion- based image generators for image analysis.arXiv preprint arXiv:2505.09358, 2025

    Bingxin Ke, Kevin Qu, Tianfu Wang, Nando Metzger, Shengyu Huang, Bo Li, Anton Obukhov, and Kon- rad Schindler. Marigold: Affordable adaptation of diffusion-based image generators for image analysis. arXiv preprint arXiv:2505.09358, 2025. 4

  22. [22]

    Segment anything

    Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Segment anything. InProceedings of the IEEE/CVF international conference on computer vi- sion, pages 4015–4026, 2023. 3

  23. [23]

    Flux, 2024

    Black Forest Labs. Flux, 2024. 2

  24. [24]

    Black Forest Labs, Stephen Batifol, Andreas Blattmann, Frederic Boesel, Saksham Consul, 9 Cyril Diagne, Tim Dockhorn, Jack English, Zion English, Patrick Esser, et al. Flux. 1 kontext: Flow matching for in-context image generation and editing in latent space.arXiv preprint arXiv:2506.15742,

  25. [25]

    UniWorld-V1: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation

    Bin Lin, Zongjian Li, Xinhua Cheng, Yuwei Niu, Yang Ye, Xianyi He, Shenghai Yuan, Wangbo Yu, Shaodong Wang, Yunyang Ge, et al. Uniworld: High-resolution semantic encoders for unified visual understanding and generation.arXiv preprint arXiv:2506.03147, 2025. 1

  26. [26]

    Microsoft coco: Common ob- jects in context

    Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C Lawrence Zitnick. Microsoft coco: Common ob- jects in context. InEuropean conference on computer vision, pages 740–755. Springer, 2014. 4, 5

  27. [27]

    Flow Matching for Generative Modeling

    Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Max- imilian Nickel, and Matt Le. Flow matching for gen- erative modeling.arXiv preprint arXiv:2210.02747,

  28. [28]

    World Model on Million-Length Video And Language With Blockwise RingAttention

    Hao Liu, Wilson Yan, Matei Zaharia, and Pieter Abbeel. World model on million-length video and language with ringattention.arXiv preprint arxiv:2402.08268, 2024. 2

  29. [29]

    Step1X-Edit: A Practical Framework for General Image Editing

    Shiyu Liu, Yucheng Han, Peng Xing, Fukun Yin, Rui Wang, Wei Cheng, Jiaqi Liao, Yingming Wang, Hong- hao Fu, Chunrui Han, et al. Step1x-edit: A practical framework for general image editing.arXiv preprint arXiv:2504.17761, 2025. 2, 1

  30. [30]

    Seg-Zero: Reasoning-Chain Guided Segmentation via Cognitive Reinforcement

    Yuqi Liu, Bohao Peng, Zhisheng Zhong, Zihao Yue, Fanbin Lu, Bei Yu, and Jiaya Jia. Seg-zero: Reasoning- chain guided segmentation via cognitive reinforce- ment.arXiv preprint arXiv:2503.06520, 2025. 7

  31. [31]

    Gpt image 1.https://platform

    OpenAI. Gpt image 1.https://platform. openai.com/docs/models/gpt- image- 1,

  32. [32]

    Accessed: 2025-11-13. 5, 6

  33. [33]

    Transfer between Modalities with MetaQueries

    Xichen Pan, Satya Narayan Shukla, Aashu Singh, Zhuokai Zhao, Shlok Kumar Mishra, Jialiang Wang, Zhiyang Xu, Jiuhai Chen, Kunpeng Li, Felix Juefei- Xu, Ji Hou, and Saining Xie. Transfer be- tween modalities with metaqueries.arXiv preprint arXiv:2504.06256, 2025. 2

  34. [34]

    On aliased resizing and surprising subtleties in gan evalu- ation

    Gaurav Parmar, Richard Zhang, and Jun-Yan Zhu. On aliased resizing and surprising subtleties in gan evalu- ation. InCVPR, 2022. 5

  35. [35]

    Scalable diffu- sion models with transformers

    William Peebles and Saining Xie. Scalable diffu- sion models with transformers. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 4195–4205, 2023. 3

  36. [36]

    SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

    Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas M ¨uller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffu- sion models for high-resolution image synthesis.arXiv preprint arXiv:2307.01952, 2023. 1, 2, 3

  37. [37]

    Sdxl: Improving latent diffusion models for high-resolution image synthesis

    Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas M ¨uller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis. InICLR,

  38. [38]

    Tokenflow: Unified image tokenizer for multimodal understanding and generation

    Liao Qu, Huichao Zhang, Yiheng Liu, Xu Wang, Yi Jiang, Yiming Gao, Hu Ye, Daniel K Du, Zehuan Yuan, and Xinglong Wu. Tokenflow: Unified image tokenizer for multimodal understanding and generation.arXiv preprint arXiv:2412.03069, 2024. 2

  39. [39]

    Exploring the limits of transfer learning with a unified text-to-text transformer.Jour- nal of machine learning research, 21(140):1–67, 2020

    Colin Raffel, Noam Shazeer, Adam Roberts, Kather- ine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer.Jour- nal of machine learning research, 21(140):1–67, 2020. 3

  40. [40]

    Zero-shot text-to-image generation

    Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea V oss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation. In International conference on machine learning, pages 8821–8831. Pmlr, 2021. 1, 2, 3

  41. [41]

    Hierarchical Text-Conditional Image Generation with CLIP Latents

    Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text- conditional image generation with clip latents.arXiv preprint arxiv:2204.06125, 2022. 2

  42. [42]

    SAM 2: Segment Anything in Images and Videos

    Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Rong- hang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman R¨adle, Chloe Rolland, Laura Gustafson, et al. Sam 2: Segment anything in images and videos. arXiv preprint arXiv:2408.00714, 2024. 3

  43. [43]

    Grounded SAM: Assembling Open-World Models for Diverse Visual Tasks

    Tianhe Ren, Shilong Liu, Ailing Zeng, Jing Lin, Kun- chang Li, He Cao, Jiayu Chen, Xinyu Huang, Yukang Chen, Feng Yan, et al. Grounded sam: Assembling open-world models for diverse visual tasks.arXiv preprint arXiv:2401.14159, 2024. 3

  44. [44]

    High- resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High- resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022. 1, 2, 3

  45. [45]

    Chameleon: Mixed-Modal Early-Fusion Foundation Models

    Chameleon Team. Chameleon: Mixed-modal early-fusion foundation models.arXiv preprint arXiv:2405.09818, 2024. 2

  46. [46]

    Mulan: A multi layer annotated dataset for controllable text-to-image generation

    Petru-Daniel Tudosiu, Yongxin Yang, Shifeng Zhang, Fei Chen, Steven McDonagh, Gerasimos Lampouras, Ignacio Iacobacci, and Sarah Parisot. Mulan: A multi layer annotated dataset for controllable text-to-image generation. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 22413–22422, 2024. 2, 3, 4, 5

  47. [47]

    Unsplash: Free high-resolution photos

    Unsplash. Unsplash: Free high-resolution photos. https://unsplash.com/. 5

  48. [48]

    Illume: Illuminating your llms to see, draw, and self-enhance.arXiv preprint arXiv:2412.06673, 2024a

    Chunwei Wang, Guansong Lu, Junwei Yang, Runhui Huang, Jianhua Han, Lu Hou, Wei Zhang, and Hang Xu. Illume: Illuminating your llms to see, draw, and self-enhance.arXiv preprint arXiv:2412.06673, 2024. 2

  49. [49]

    Instance shadow detection with a single- stage detector.IEEE transactions on pattern analysis and machine intelligence, 45(3):3259–3273, 2022

    Tianyu Wang, Xiaowei Hu, Pheng-Ann Heng, and Chi- Wing Fu. Instance shadow detection with a single- stage detector.IEEE transactions on pattern analysis and machine intelligence, 45(3):3259–3273, 2022. 4, 5

  50. [50]

    Emu3: Next-Token Prediction is All You Need

    Xinlong Wang, Xiaosong Zhang, Zhengxiong Luo, Quan Sun, Yufeng Cui, Jinsheng Wang, Fan Zhang, Yueze Wang, Zhen Li, Qiying Yu, et al. Emu3: Next-token prediction is all you need.arXiv preprint arxiv:2409.18869, 2024. 2

  51. [51]

    Omnieraser: Remove objects and their effects in images with paired 10 video-frame data.arXiv preprint arXiv:2501.07397,

    Runpu Wei, Zijin Yin, Shuo Zhang, Lanxiang Zhou, Xueyi Wang, Chao Ban, Tianwei Cao, Hao Sun, Zhongjiang He, Kongming Liang, et al. Omnieraser: Remove objects and their effects in images with paired 10 video-frame data.arXiv preprint arXiv:2501.07397,

  52. [52]

    Object- drop: Bootstrapping counterfactuals for photorealistic object removal and insertion

    Daniel Winter, Matan Cohen, Shlomi Fruchter, Yael Pritch, Alex Rav-Acha, and Yedid Hoshen. Object- drop: Bootstrapping counterfactuals for photorealistic object removal and insertion. InEuropean Conference on Computer Vision, pages 112–129. Springer, 2024. 3, 5

  53. [53]

    Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation

    Chengyue Wu, Xiaokang Chen, Zhiyu Wu, Yiyang Ma, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, Chong Ruan, and Ping Luo. Janus: Decoupling visual encoding for unified multi- modal understanding and generation.arXiv preprint arXiv:2410.13848, 2024. 2

  54. [54]

    Qwen-Image Technical Report

    Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al. Qwen-image technical re- port.arXiv preprint arXiv:2508.02324, 2025. 1, 2, 3, 5, 6, 7

  55. [55]

    OmniGen2: Towards Instruction-Aligned Multimodal Generation

    Chenyuan Wu, Pengfei Zheng, Ruiran Yan, Shitao Xiao, Xin Luo, Yueze Wang, Wanli Li, Xiyan Jiang, Yexin Liu, Junjie Zhou, et al. Omnigen2: Exploration to advanced multimodal generation.arXiv preprint arXiv:2506.18871, 2025. 1

  56. [56]

    Show-o: One Single Transformer to Unify Multimodal Understanding and Generation

    Jinheng Xie, Weijia Mao, Zechen Bai, David Junhao Zhang, Weihao Wang, Kevin Qinghong Lin, Yuchao Gu, Zhijie Chen, Zhenheng Yang, and Mike Zheng Shou. Show-o: One single transformer to unify mul- timodal understanding and generation.arXiv preprint arxiv:2408.12528, 2024. 2

  57. [57]

    Generative image layer decomposition with visual effects

    Jinrui Yang, Qing Liu, Yijun Li, Soo Ye Kim, Daniil Pakhomov, Mengwei Ren, Jianming Zhang, Zhe Lin, Cihang Xie, and Yuyin Zhou. Generative image layer decomposition with visual effects. InProceedings of the Computer Vision and Pattern Recognition Confer- ence, pages 7643–7653, 2025. 3, 4, 5

  58. [58]

    Complex- edit: Cot-like instruction generation for complexity-controllable image editing benchmark.arXiv preprint arXiv:2504.13143,

    Siwei Yang, Mude Hui, Bingchen Zhao, Yuyin Zhou, Nataniel Ruiz, and Cihang Xie. Complexedit: Cot-like instruction generation for complexity- controllable image editing benchmark.arXiv preprint arXiv:2504.13143, 2025. 5, 1

  59. [59]

    ImgEdit: A Unified Image Editing Dataset and Benchmark

    Yang Ye, Xianyi He, Zongjian Li, Bin Lin, Shenghai Yuan, Zhiyuan Yan, Bohan Hou, and Li Yuan. Imgedit: A unified image editing dataset and benchmark.arXiv preprint arXiv:2505.20275, 2025. 6, 1

  60. [60]

    Anyedit: Mastering unified high-quality image editing for any idea

    Qifan Yu, Wei Chow, Zhongqi Yue, Kaihang Pan, Yang Wu, Xiaoyang Wan, Juncheng Li, Siliang Tang, Han- wang Zhang, and Yueting Zhuang. Anyedit: Mastering unified high-quality image editing for any idea. InPro- ceedings of the Computer Vision and Pattern Recogni- tion Conference, pages 26125–26135, 2025. 1

  61. [61]

    Magicbrush: A manually annotated dataset for instruction-guided image editing.Advances in Neu- ral Information Processing Systems, 36:31428–31449,

    Kai Zhang, Lingbo Mo, Wenhu Chen, Huan Sun, and Yu Su. Magicbrush: A manually annotated dataset for instruction-guided image editing.Advances in Neu- ral Information Processing Systems, 36:31428–31449,

  62. [62]

    Transparent image layer diffusion using latent trans- parency.arXiv preprint arXiv:2402.17113, 2024

    Lvmin Zhang and Maneesh Agrawala. Transparent im- age layer diffusion using latent transparency.arXiv preprint arXiv:2402.17113, 2024. 2, 3, 5, 6, 7

  63. [63]

    Adding conditional control to text-to-image diffusion models

    Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. InProceedings of the IEEE/CVF interna- tional conference on computer vision, pages 3836– 3847, 2023. 2

  64. [64]

    The unreasonable ef- fectiveness of deep features as a perceptual metric

    Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable ef- fectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on computer vi- sion and pattern recognition, pages 586–595, 2018. 5

  65. [65]

    Text2layer: Layered image generation using latent diffusion model.arXiv preprint arXiv:2307.09781, 2023

    Xinyang Zhang, Wentian Zhao, Xin Lu, and Jeff Chien. Text2layer: Layered image genera- tion using latent diffusion model.arXiv preprint arXiv:2307.09781, 2023. 3

  66. [66]

    In-Context Edit: Enabling Instructional Image Editing with In-Context Generation in Large Scale Diffusion Transformer

    Zechuan Zhang, Ji Xie, Yu Lu, Zongxin Yang, and Yi Yang. In-context edit: Enabling instructional image editing with in-context generation in large scale dif- fusion transformer.arXiv preprint arXiv:2504.20690,

  67. [67]

    Ultraedit: Instruction-based fine-grained image editing at scale

    Haozhe Zhao, Xiaojian Shawn Ma, Liang Chen, Shuzheng Si, Rujie Wu, Kaikai An, Peiyu Yu, Min- jia Zhang, Qing Li, and Baobao Chang. Ultraedit: Instruction-based fine-grained image editing at scale. Advances in Neural Information Processing Systems, 37:3058–3093, 2024. 2, 1

  68. [68]

    ObjectClear: Complete object removal via object-effect attention,

    Jixin Zhao, Shangchen Zhou, Zhouxia Wang, Peiqing Yang, and Chen Change Loy. Objectclear: Com- plete object removal via object-effect attention.arXiv preprint arXiv:2505.22636, 2025. 3

  69. [69]

    Tp-blend: Textual- prompt attention pairing for precise object-style blend- ing in diffusion models.Transactions on Machine Learning Research, 2025

    Yichuan Zhong, Yapeng Tian, et al. Tp-blend: Textual- prompt attention pairing for precise object-style blend- ing in diffusion models.Transactions on Machine Learning Research, 2025. 2

  70. [70]

    Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model

    Chunting Zhou, Lili Yu, Arun Babu, Kushal Tiru- mala, Michihiro Yasunaga, Leonid Shamis, Jacob Kahn, Xuezhe Ma, Luke Zettlemoyer, and Omer Levy. Transfusion: Predict the next token and diffuse im- ages with one multi-modal model.arXiv preprint arxiv:2408.11039, 2024. 2

  71. [71]

    Addition

    Junhao Zhuang, Yanhong Zeng, Wenran Liu, Chun Yuan, and Kai Chen. A task is worth one word: Learn- ing with task prompts for high-quality versatile image inpainting. InEuropean Conference on Computer Vi- sion, pages 195–211. Springer, 2024. 3 11 Controllable Layered Image Generation for Real-World Editing Supplementary Material Model Addition Background M...

  72. [72]

    Addition

    Comparison on Public Benchmarks To comprehensively evaluate the ability of LASAGNA, we additionally evaluate LASAGNAon two public benchmarks: ImgEdit-Bench [58] and GenEval [15]. In ImgEdit-Bench, the“Addition” and “Background” tasks basically match our generation modes: FG Cond and BG Cond. GenEval matches our Text2All gener- ation mode. The results are ...

  73. [73]

    11, we provide more results of our model under three different modes (FG Gen, BG Gen, and Text2All)

    More Qualitative Results from LASAGNA As shown in Fig.9, Fig.10, and Fig. 11, we provide more results of our model under three different modes (FG Gen, BG Gen, and Text2All). Specifically, in Fig.9 and Fig.10, for the FG Gen and BG Gen modes, we further demonstrate more flexible applications. We can fix the background and generate different foregrounds, o...

  74. [74]

    12, we provide more samples from LASAGNA-48K

    More Samples from LASAGNA-48K As shown in Fig. 12, we provide more samples from LASAGNA-48K

  75. [75]

    13, we provide more samples from LASAGNABENCH

    More Samples from LASAGNABENCH As shown in Fig. 13, we provide more samples from LASAGNABENCH

  76. [76]

    GPT Score

    Detailed Scores of the GPT Score As shown inTable 4in the main manuscript, we pro- vide the “GPT Score”, which is the average score of the instruction-following and identity-preserving met- rics proposed by Complex-Edit [57]. Here, we addi- tionally provide the original score in Table 10. We run the same prompt three times and take the aver- age as the or...

  77. [77]

    The results show the superiority of our generation paradigm

    Details of the Metric in Section 4.4 (Layer Editing with Visual Effects) To illustrate the necessity of our proposed generation paradigm (Layer Editing with Visual Effects), we con- duct quantitative experiments inSection 4.4of the main manuscript. The results show the superiority of our generation paradigm. We compare three editing modes:Instruct Editing...

  78. [78]

    Based on the context of our question and the value of the parameter, GPT-5 generates an appropriate nat- ural language description, as shown in the template in Fig. 14. In the subsequent Complex Editing task, we combine the two types of instructions accordingly. This helps ensure that all three types of editing per- form the same actions as much as possib...

  79. [79]

    FG Gen” denotes background-conditioned foreground layer generation, “BG Gen

    Training Samples of Data Curator InSection 3.2, LASAGNA-48K Dataset in the main manuscript, we demonstrate the complete data con- struction pipeline. The data curator is a key compo- nent in the data construction pipeline. To obtain high- quality data filtering results, we carefully annotated about 30K samples by humans, as shown in Fig. 15, to train the ...