pith. machine review for the scientific record. sign in

arxiv: 2605.11818 · v1 · submitted 2026-05-12 · 💻 cs.CV

Recognition: unknown

RevealLayer: Disentangling Hidden and Visible Layers via Occlusion-Aware Image Decomposition

Authors on Pith no claims yet

Pith reviewed 2026-05-13 07:43 UTC · model grok-4.3

classification 💻 cs.CV
keywords layer decompositiondiffusion modelsocclusion handlingRGBA layersattention mechanismsimage decompositiondataset creationboundary enforcement
0
0 comments X

The pith

A diffusion-based framework decomposes natural images into multiple clean RGBA layers by handling occlusions explicitly.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Existing diffusion approaches for image layer decomposition struggle with occlusion completion, robust disentanglement of layers, and precise boundaries in complex natural scenes, and are limited by scarce high-quality datasets. RevealLayer addresses these by introducing a diffusion model that separates an input RGB image into several RGBA layers. The framework adds region-aware attention to separate hidden from visible content, an occlusion-guided adapter that uses surrounding context to fill overlaps, and a composite loss that sharpens alpha edges while removing leftover artifacts. A new 100K multi-layer dataset built with automated tools plus human checks, plus an accompanying benchmark, support training and testing. If the method works, it would let users extract and edit individual layers from ordinary photographs with reliable recovery of what is hidden behind foreground objects.

Core claim

RevealLayer is a diffusion-based framework that decomposes an RGB image into multiple RGBA layers. It uses a Region-Aware Attention module to disentangle hidden and visible layers, an Occlusion-Guided Adapter to enhance overlapping regions with contextual information, and a composite loss to enforce sharp alpha boundaries and suppress residual artifacts. Training and evaluation rely on the RevealLayer-100K dataset, constructed via automated algorithms and human annotation, together with the RevealLayerBench benchmark for general natural scenes. Experiments show the approach consistently outperforms prior methods in layer decomposition accuracy.

What carries the argument

The RevealLayer diffusion framework, driven by a Region-Aware Attention module for layer disentanglement, an Occlusion-Guided Adapter for contextual occlusion handling, and a composite loss for boundary precision.

If this is right

  • More reliable recovery of content hidden behind foreground objects in everyday photos.
  • Sharper alpha masks and fewer blending artifacts than previous diffusion decomposition methods.
  • A new public benchmark and 100K dataset that future work can use to measure progress on natural-scene layer separation.
  • Direct support for downstream tasks such as layer-aware editing and compositing that require clean foreground/background splits.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same occlusion-handling modules could be adapted to video by adding temporal consistency constraints across frames.
  • Clean per-layer outputs would simplify insertion of new objects into scenes without manual masking.
  • The approach may generalize to medical or satellite imagery where layers correspond to depth or tissue planes.

Load-bearing premise

The Region-Aware Attention, Occlusion-Guided Adapter, and composite loss together solve occlusion completion and layer separation in complex natural images without creating new artifacts, and the RevealLayer-100K dataset is representative of real scenes.

What would settle it

A collection of natural photographs containing intricate occlusions where the output layers exhibit visible boundary errors or residual blending when inspected against human-annotated ground truth.

Figures

Figures reproduced from arXiv: 2605.11818 by Binhao Wang, Bo Cheng, Dawei Leng, Liebucha Wu, Qiuyu Ji, Shanyuan Liu, Shihao Zhao, Yuhang Ma, Yuhui Yin.

Figure 1
Figure 1. Figure 1: Qualitative comparison of layered image decomposition in natural scenes. RevealLayer exhibits strong capability in artifact removal, occlusion completion, and content consistency. Most prior approaches tackle this problem via cascaded pipelines that sequentially perform instance segmentation, alpha matting, and image inpainting using specialized mod￾els. Such multi-stage designs are highly sensitive to int… view at source ↗
Figure 2
Figure 2. Figure 2: RevealLayer decomposes an input image into multiple RGBA layers with explicit transparency according to user-specified bounding boxes. Our method demonstrates strong capability in completing overlapping regions, accurately recovering object boundaries, and handling transparent objects, while also maintaining high visual consistency in the visible regions. scenes, motivating the development of more flexible… view at source ↗
Figure 3
Figure 3. Figure 3: The framework of RevealLayer, a controllable layer decomposition architecture based on FLUX. It incorporates Region-Aware Attention (RAA) and an Occlusion-Guided Adapter (OGA) to enhance layer disentanglement and occlusion completion, while alpha and orthogonality losses are employed to suppress boundary blur and residual artifacts. the MM-DiT as its backbone. We formulate the decomposi￾tion task as a vari… view at source ↗
Figure 4
Figure 4. Figure 4: Dataset curation pipeline of RevealLayer-100K and RevealLayerBench. Lorth = X N j=1 |⟨ˆI bg RGB, ˆI f gj RGB⟩Ri − ⟨I bg RGB, If gj RGB⟩Ri | (15) where ⟨·, ·⟩ denotes the pixel-wise cosine similarity. Ri denotes the mask corresponding to the region Bi . ˆI bg RGB, ˆI f gj RGB are obtained via Eq. (12). Total Loss. The final training objective is defined by the following loss function: L = LFM + λαLα + λoLor… view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative comparison of Image-to-Multi-RGBA. Qwen-Image-Layered and CLD suffer from artifacts in overlapping regions, missing foreground objects, and poor consistency in visible areas, while RevealLayer demonstrates strong performance in layer disentanglement, occluded content recovery, and accurate object boundary reconstruction [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: The category distribution and layer number distribution of our RevealLayer-100k. B.1. Object Removal Conventional object removal models typically necessitate re￾fined masks encompassing object effects (e.g., shadows and reflections) for optimal performance. We conducted supple￾mentary experiments on the OBER-Test benchmark, pro￾viding baseline methods with effect masks as input, while our method utilizes o… view at source ↗
Figure 7
Figure 7. Figure 7: Qualitative results of controllable layer decomposition. Our method consistently decomposes the desired layers, regardless of the number or location of selected regions [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Comparison of the degree of layer separation. This shows the orthogonality loss values for each denoising step during inference on RevealBench. RevealBench dataset, utilizing the disentanglement metric defined in Eq. (15). As illustrated in [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Distribution of texture complexity for background and foreground regions. The x-axis represents the Log-Variance of the Laplacian operator, higher values indicate sharper images with more high-frequency edge information [PITH_FULL_IMAGE:figures/full_fig_p015_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Visual results in OBER-Test. Image with mask PowerPaint SmartEraser RORem Attentive Eraser ObjectClear Ours [PITH_FULL_IMAGE:figures/full_fig_p017_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Visual results in RevealLayerBench. 17 [PITH_FULL_IMAGE:figures/full_fig_p017_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Visual results in AIM500. Image SAM-H SAM2-L SAM3 MAM Ours GT [PITH_FULL_IMAGE:figures/full_fig_p018_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Visual results in RefMatte-RW100. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Additional visual results of layer decomposition [PITH_FULL_IMAGE:figures/full_fig_p019_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Additional visual results of layer decomposition. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Additional visual results of layer decomposition [PITH_FULL_IMAGE:figures/full_fig_p020_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Additional visual results demonstrating the controllability of layer decomposition. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_17.png] view at source ↗
read the original abstract

Recent diffusion-based approaches have made substantial progress in image layer decomposition. However, accurately decomposing complex natural images remains challenging due to difficulties in occlusion completion, robust layer disentanglement, and precise foreground boundaries. Moreover, the scarcity of high-quality multi-layer natural image datasets limits advancement. To address these challenges, we propose RevealLayer, a diffusion-based framework that decomposes an RGB image into multiple RGBA layers, enabling precise layer separation and reliable recovery of occluded content in natural images. RevealLayer incorporates three key components: (1) a Region-Aware Attention module to disentangle hidden and visible layers; (2) an Occlusion-Guided Adapter to leverage contextual information to enhance overlapping regions; and (3) a composite loss to enforce sharp alpha boundaries and suppress residual artifacts. To support training and evaluation, we introduce RevealLayer-100K, a high-quality multi-layer natural image constructed through a collaboration between automated algorithms and human annotation, and further establish RevealLayerBench for benchmarking layer decomposition in general natural scenes. Extensive experiments demonstrate that RevealLayer consistently outperforms existing approaches in layer decomposition.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper proposes RevealLayer, a diffusion-based framework for decomposing RGB images into multiple RGBA layers in natural scenes. It features a Region-Aware Attention module for layer disentanglement, an Occlusion-Guided Adapter for contextual enhancement in overlaps, and a composite loss for sharp alpha boundaries and artifact suppression. The authors introduce the RevealLayer-100K dataset constructed via automated and human annotation, along with RevealLayerBench, and report consistent outperformance over existing diffusion-based methods.

Significance. If validated, this work advances the state of the art in image layer decomposition by addressing key challenges in occlusion completion and disentanglement for complex natural images. The new dataset fills a gap in high-quality multi-layer data, which could facilitate further research in computer vision applications such as image editing and augmented reality.

major comments (1)
  1. Experimental Evaluation: The abstract claims consistent outperformance on RevealLayerBench, but the summary provides no details on baselines, error bars, data splits, or statistical significance; the full manuscript must report these explicitly to substantiate the central claim, as the soundness assessment notes this gap.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive assessment of our work and the recommendation for minor revision. We address the single major comment below.

read point-by-point responses
  1. Referee: [—] Experimental Evaluation: The abstract claims consistent outperformance on RevealLayerBench, but the summary provides no details on baselines, error bars, data splits, or statistical significance; the full manuscript must report these explicitly to substantiate the central claim, as the soundness assessment notes this gap.

    Authors: We appreciate the referee highlighting the need for explicit experimental details. The full manuscript (Section 4) already specifies the baselines (prior diffusion-based layer decomposition methods), the RevealLayerBench data splits (70/15/15 train/val/test), and reports quantitative results with error bars (mean ± std over three random seeds) in Tables 2–4. To further strengthen the presentation and address the soundness concern, we will add a short paragraph on statistical significance (paired t-tests with p-values) in the revised manuscript. This constitutes a minor clarification rather than new experiments. revision: partial

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper introduces new architectural components (Region-Aware Attention module, Occlusion-Guided Adapter, composite loss) and a newly constructed RevealLayer-100K dataset to enable layer decomposition. Claims of outperforming prior diffusion-based methods rest on these independent innovations and reported metrics on RevealLayerBench, without any reduction of predictions to fitted inputs, self-definitional equations, or load-bearing self-citations. The derivation chain is self-contained, with no steps matching the enumerated circular patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claim rests on the domain assumption that diffusion models can be extended to multi-layer decomposition with the described modules, and on the ad-hoc construction of the RevealLayer-100K dataset.

axioms (1)
  • domain assumption Diffusion-based models can be adapted for precise layer disentanglement and occlusion completion in natural images
    Invoked as the foundation for the RevealLayer framework in the abstract.
invented entities (2)
  • Region-Aware Attention module no independent evidence
    purpose: To disentangle hidden and visible layers
    New component introduced to address layer separation challenges
  • Occlusion-Guided Adapter no independent evidence
    purpose: To leverage contextual information for overlapping regions
    New component introduced to enhance occlusion handling

pith-pipeline@v0.9.0 · 5517 in / 1270 out tokens · 31144 ms · 2026-05-13T07:43:01.692559+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

79 extracted references · 79 canonical work pages · 9 internal anchors

  1. [1]

    Structure and Interpretation of Computer Programs

    Harold Abelson and Gerald Jay Sussman and Julie Sussman. Structure and Interpretation of Computer Programs. 1985

  2. [2]

    , author=

    Lora: Low-rank adaptation of large language models. , author=. ICLR , volume=

  3. [3]

    Visual Information Extraction with Lixto

    Robert Baumgartner and Georg Gottlob and Sergio Flesca. Visual Information Extraction with Lixto. Proceedings of the 27th International Conference on Very Large Databases. 2001

  4. [4]

    Brachman and James G

    Ronald J. Brachman and James G. Schmolze. An overview of the KL-ONE knowledge representation system. Cognitive Science. 1985

  5. [5]

    Complexity results for nonmonotonic logics

    Georg Gottlob. Complexity results for nonmonotonic logics. Journal of Logic and Computation. 1992

  6. [6]

    Hypertree Decompositions and Tractable Queries

    Georg Gottlob and Nicola Leone and Francesco Scarcello. Hypertree Decompositions and Tractable Queries. Journal of Computer and System Sciences. 2002

  7. [7]

    Levesque

    Hector J. Levesque. Foundations of a functional approach to knowledge representation. Artificial Intelligence. 1984

  8. [8]

    Levesque

    Hector J. Levesque. A logic of implicit and explicit belief. Proceedings of the Fourth National Conference on Artificial Intelligence. 1984

  9. [9]

    On the compilability and expressive power of propositional planning formalisms

    Bernhard Nebel. On the compilability and expressive power of propositional planning formalisms. Journal of Artificial Intelligence Research. 2000

  10. [10]

    Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

    High-resolution image synthesis with latent diffusion models , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

  11. [11]

    Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

    Stable video diffusion: Scaling latent video diffusion models to large datasets , author=. arXiv preprint arXiv:2311.15127 , year=

  12. [12]

    Jinrui Yang and Qing Liu and Yijun Li and Soo Ye Kim and Daniil Pakhomov and Mengwei Ren and Jianming Zhang and Zhe Lin and Cihang Xie and Yuyin Zhou , title =

  13. [13]

    CoRR , volume =

    Yusuf Dalva and Yijun Li and Qing Liu and Nanxuan Zhao and Jianming Zhang and Zhe Lin and Pinar Yanardag , title =. CoRR , volume =

  14. [14]

    FlowEdit: Inversion-Free Text-Based Editing Using Pre-Trained Flow Models , journal =

    Vladimir Kulikov and Matan Kleiner and Inbar Huberman. FlowEdit: Inversion-Free Text-Based Editing Using Pre-Trained Flow Models , journal =. 2024 , url =. doi:10.48550/ARXIV.2412.08629 , eprinttype =. 2412.08629 , timestamp =

  15. [15]

    CoRR , volume =

    Kyoungkook Kang and Gyujin Sim and Geonung Kim and Donguk Kim and Seungho Nam and Sunghyun Cho , title =. CoRR , volume =

  16. [16]

    CoRR , volume =

    Junjia Huang and Pengxiang Yan and Jinhang Cai and Jiyang Liu and Zhao Wang and Yitong Wang and Xinglong Wu and Guanbin Li , title =. CoRR , volume =

  17. [17]

    CoRR , volume =

    Dingbang Huang and Wenbo Li and Yifei Zhao and Xinyu Pan and Yanhong Zeng and Bo Dai , title =. CoRR , volume =

  18. [18]

    CoRR , volume =

    Junwen Chen and Heyang Jiang and Yanbin Wang and Keming Wu and Ji Li and Chao Zhang and Keiji Yanai and Dong Chen and Yuhui Yuan , title =. CoRR , volume =

  19. [19]

    LayerD: Decomposing Raster Graphic Designs into Layers , journal =

    Tomoyuki Suzuki and Kang. LayerD: Decomposing Raster Graphic Designs into Layers , journal =. 2025 , url =

  20. [20]

    arXiv preprint arXiv:2511.16249 , year=

    Controllable Layer Decomposition for Reversible Multi-Layer Image Generation , author=. arXiv preprint arXiv:2511.16249 , year=

  21. [21]

    CoRR , volume =

    Yueru Jia and Yuhui Yuan and Aosong Cheng and Chuke Wang and Ji Li and Huizhu Jia and Shanghang Zhang , title =. CoRR , volume =. 2024 , url =

  22. [22]

    OmniAlpha: Aligning Transparency-Aware Generation via Multi-Task Unified Reinforcement Learning

    OmniAlpha: A Sequence-to-Sequence Framework for Unified Multi-Task RGBA Generation , author=. arXiv preprint arXiv:2511.20211 , year=

  23. [23]

    Visiting the Invisible: Layer-by-Layer Completed Scene Decomposition , journal =

    Chuanxia Zheng and Duy. Visiting the Invisible: Layer-by-Layer Completed Scene Decomposition , journal =. 2021 , url =. doi:10.1007/S11263-021-01517-0 , timestamp =

  24. [24]

    Omnipsd: Layered psd generation with diffusion transformer.arXiv preprint arXiv:2512.09247, 2025

    OmniPSD: Layered PSD Generation with Diffusion Transformer , author=. arXiv preprint arXiv:2512.09247 , year=

  25. [25]

    Qwen-image-layered: Towards inherent editability via layer decomposition.arXiv preprint arXiv:2512.15603, 2025

    Qwen-Image-Layered: Towards Inherent Editability via Layer Decomposition , author=. arXiv preprint arXiv:2512.15603 , year=

  26. [26]

    Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

    Anydoor: Zero-shot object-level image customization , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

  27. [27]

    Proceedings of the AAAI Conference on Artificial Intelligence , volume=

    Insert anything: Image insertion via in-context editing in dit , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

  28. [28]

    arXiv preprint arXiv:2509.21278 , year=

    Does flux already know how to perform physically plausible image composition? , author=. arXiv preprint arXiv:2509.21278 , year=

  29. [29]

    Computer Vision -

    Junhao Zhuang and Yanhong Zeng and Wenran Liu and Chun Yuan and Kai Chen , title =. Computer Vision -

  30. [30]

    Proceedings of the AAAI Conference on Artificial Intelligence , volume=

    Attentive eraser: Unleashing diffusion model’s object removal potential via self-attention redirection guidance , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

  31. [31]

    Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

    Smarteraser: Remove anything from images using masked-region guidance , author=. Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

  32. [32]

    Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

    RORem: Training a Robust Object Remover with Human-in-the-Loop , author=. Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

  33. [33]

    CVPR , year =

    Precise Object and Effect Removal with Adaptive Target-Aware Attention , author =. CVPR , year =

  34. [34]

    ICCV , pages=

    Omnipaint: Mastering object-oriented editing via disentangled insertion-removal inpainting , author=. ICCV , pages=

  35. [35]

    Proceedings of the IEEE/CVF international conference on computer vision , pages=

    Segment anything , author=. Proceedings of the IEEE/CVF international conference on computer vision , pages=

  36. [36]

    The Thirteenth International Conference on Learning Representations , year=

    Nikhila Ravi and Valentin Gabeur and Yuan-Ting Hu and Ronghang Hu and Chaitanya Ryali and Tengyu Ma and Haitham Khedr and Roman R. The Thirteenth International Conference on Learning Representations , year=

  37. [37]

    SAM 3: Segment Anything with Concepts

    Sam 3: Segment anything with concepts , author=. arXiv preprint arXiv:2511.16719 , year=

  38. [38]

    Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

    Sdmatte: Grafting diffusion models for interactive matting , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

  39. [39]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

    Matting anything , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

  40. [40]

    1 Kontext: Flow Matching for In-Context Image Generation and Editing in Latent Space , author=

    FLUX. 1 Kontext: Flow Matching for In-Context Image Generation and Editing in Latent Space , author=. arXiv e-prints , pages=

  41. [41]

    Flux.1 [dev] , year =

  42. [42]

    Flux.1-Fill [dev] , year =

  43. [43]

    Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

    Instructpix2pix: Learning to follow image editing instructions , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

  44. [44]

    Prompt-to-Prompt Image Editing with Cross Attention Control

    Prompt-to-prompt image editing with cross attention control , author=. arXiv preprint arXiv:2208.01626 , year=

  45. [45]

    arXiv preprint arXiv:2404.09990 , year=

    Hq-edit: A high-quality dataset for instruction-based image editing , author=. arXiv preprint arXiv:2404.09990 , year=

  46. [46]

    Longcat-image technical report

    LongCat-Image Technical Report , author=. arXiv preprint arXiv:2512.07584 , year=

  47. [47]

    European conference on computer vision , pages=

    Grounding dino: Marrying dino with grounded pre-training for open-set object detection , author=. European conference on computer vision , pages=. 2024 , organization=

  48. [48]

    The Thirteenth International Conference on Learning Representations , year=

    Hd-painter: high-resolution and prompt-faithful text-guided image inpainting with diffusion models , author=. The Thirteenth International Conference on Learning Representations , year=

  49. [49]

    Proceedings of the IEEE/CVF winter conference on applications of computer vision , pages=

    Latentpaint: Image inpainting in latent space with diffusion models , author=. Proceedings of the IEEE/CVF winter conference on applications of computer vision , pages=

  50. [50]

    arXiv preprint arXiv:2304.06790 , year=

    Inpaint anything: Segment anything meets image inpainting , author=. arXiv preprint arXiv:2304.06790 , year=

  51. [51]

    Proceedings of the IEEE/CVF international conference on computer vision , pages=

    Adding conditional control to text-to-image diffusion models , author=. Proceedings of the IEEE/CVF international conference on computer vision , pages=

  52. [52]

    European Conference on Computer Vision , pages=

    Brushnet: A plug-and-play image inpainting model with decomposed dual-branch diffusion , author=. European Conference on Computer Vision , pages=. 2024 , organization=

  53. [53]

    SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations

    Sdedit: Guided image synthesis and editing with stochastic differential equations , author=. arXiv preprint arXiv:2108.01073 , year=

  54. [54]

    Text2layer: Layered image generation using latent diffusion model.arXiv preprint arXiv:2307.09781, 2023

    Text2layer: Layered image generation using latent diffusion model , author=. arXiv preprint arXiv:2307.09781 , year=

  55. [55]

    Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

    Omnigen: Unified image generation , author=. Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

  56. [56]

    OmniGen2: Towards Instruction-Aligned Multimodal Generation

    OmniGen2: Exploration to Advanced Multimodal Generation , author=. arXiv preprint arXiv:2506.18871 , year=

  57. [57]

    2025 , eprint=

    Qwen-Image Technical Report , author=. 2025 , eprint=

  58. [58]

    ICCV , year=

    CanvasVAE: Learning to Generate Vector Graphic Documents , author=. ICCV , year=

  59. [59]

    Kosmos-2: Grounding Multimodal Large Language Models to the World

    Kosmos-2: Grounding multimodal large language models to the world , author=. arXiv preprint arXiv:2306.14824 , year=

  60. [60]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

    Mulan: A multi layer annotated dataset for controllable text-to-image generation , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

  61. [61]

    Forty-first international conference on machine learning , year=

    Scaling rectified flow transformers for high-resolution image synthesis , author=. Forty-first international conference on machine learning , year=

  62. [62]

    arXiv preprint arXiv:2410.14324 , year=

    Hico: Hierarchical controllable diffusion model for layout-to-image generation , author=. arXiv preprint arXiv:2410.14324 , year=

  63. [63]

    arXiv preprint arXiv:2503.09242 , year=

    NAMI: Efficient Image Generation via Bridged Progressive Rectified Flow Transformers , author=. arXiv preprint arXiv:2503.09242 , year=

  64. [64]

    Proceedings of the AAAI Conference on Artificial Intelligence , volume=

    Bridge Diffusion Model: Bridge Chinese Text-to-Image Diffusion Model with English Communities , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

  65. [65]

    Transparent image layer diffusion using latent trans- parency.arXiv preprint arXiv:2402.17113, 2024

    Transparent image layer diffusion using latent transparency , author=. arXiv preprint arXiv:2402.17113 , year=

  66. [66]

    arXiv preprint arXiv:2507.09308 , year=

    Alphavae: Unified end-to-end RGBA image reconstruction and generation with alpha-aware representation learning , author=. arXiv preprint arXiv:2507.09308 , year=

  67. [67]

    Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

    Art: Anonymous region transformer for variable multi-layer transparent image generation , author=. Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

  68. [68]

    ACM SIGGRAPH 2024 Conference Papers , pages=

    Object-level scene deocclusion , author=. ACM SIGGRAPH 2024 Conference Papers , pages=

  69. [69]

    Proceedings of the IEEE/CVF international conference on computer vision , pages=

    Unsupervised layered image decomposition into object prototypes , author=. Proceedings of the IEEE/CVF international conference on computer vision , pages=

  70. [70]

    Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

    Self-supervised scene de-occlusion , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

  71. [71]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

    Florence-2: Advancing a unified representation for a variety of vision tasks , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

  72. [72]

    Qwen3 Technical Report

    Qwen3 technical report , author=. arXiv preprint arXiv:2505.09388 , year=

  73. [73]

    Advances in neural information processing systems , volume=

    Laion-5b: An open large-scale dataset for training next generation image-text models , author=. Advances in neural information processing systems , volume=

  74. [74]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

    Instance-wise occlusion and depth orders in natural scenes , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

  75. [75]

    Jingfeng Yao and Xinggang Wang and Shusheng Yang and Baoyuan Wang , title =. Inf. Fusion , volume =

  76. [76]

    CoRR , volume =

    Weiqi Li and Xuanyu Zhang and Shijie Zhao and Yabin Zhang and Junlin Li and Li Zhang and Jian Zhang , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2503.22679 , eprinttype =

  77. [77]

    Deep Automatic Natural Image Matting , booktitle =

    Jizhizi Li and Jing Zhang and Dacheng Tao , editor =. Deep Automatic Natural Image Matting , booktitle =

  78. [78]

    Jizhizi Li and Jing Zhang and Dacheng Tao , title =

  79. [79]

    Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer

    Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer , author=. arXiv preprint arXiv:2511.22699 , year=