arxiv: 2605.14552 · v1 · submitted 2026-05-14 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

LiWi: Layering in the Wild

Yu He , Fang Li , Haoyang Tong , Lichen Ma , Xinyuan Shan , Jingling Fu , Dong Chen , Luohang Liu

show 2 more authors

Junshi Huang Yan Li

Authors on Pith no claims yet

Pith reviewed 2026-05-15 01:58 UTC · model grok-4.3

classification 💻 cs.CV

keywords natural image decompositionlayered image representationagent-driven data synthesisshadow-guided learningalpha boundary accuracyphotometric fidelityimage editing

0 comments

The pith

Agent-driven synthesis creates over 100,000 layered natural images and trains models to decompose them with state-of-the-art accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper targets the problem of breaking natural photographs into editable layers, a task previously limited by scarce real-world training data and weak modeling of lighting and edges. It first deploys an Agent-driven Data Decomposition pipeline that automatically assembles a 100k-image dataset of layered in-the-wild photos. The resulting model then applies shadow-guided learning to capture illumination interactions and a degradation-restoration step that sharpens alpha boundaries by recovering clean foregrounds from degraded inputs. These steps together produce measurable gains in RGB L1 error and Alpha IoU over prior decomposition methods, opening practical layered editing for everyday images.

Core claim

We present an Agent-driven Data Decomposition pipeline that orchestrates agents and tools to generate a large-scale dataset of more than 100,000 high-quality layered natural images without manual labeling. We then train a decomposition network with shadow-guided objectives that explicitly model illumination effects and a degradation-restoration loss that supplies boundary supervision by reconstructing the clean foreground from a degraded version, yielding state-of-the-art results on RGB L1 and Alpha IoU metrics for natural image decomposition.

What carries the argument

The ADD pipeline for automatic layered-data synthesis combined with shadow-guided learning for illumination effects and degradation-restoration supervision for alpha boundary accuracy.

If this is right

Scalable creation of layered training data becomes possible without human annotation effort.
Models gain explicit handling of real-world shadow and lighting interactions during decomposition.
Alpha mattes achieve higher boundary precision on natural images with complex edges.
Fine-grained editing applications extend from graphic design to ordinary photographs.
Quantitative metrics for decomposition quality improve consistently across standard benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Extending the ADD pipeline with temporal constraints could support layered video decomposition.
The generated dataset may expose systematic lighting biases that current models inherit from graphic-design data.
Integration with text-to-image generators could enable direct synthesis of layered natural scenes from prompts.

Load-bearing premise

The automated ADD pipeline produces accurate layered ground truth for complex natural scenes that correctly reflects real illumination and object boundaries.

What would settle it

A side-by-side test on a fresh set of real photographs with independent hand-annotated layers showing that the model fails to improve on the best prior RGB L1 or Alpha IoU scores.

Figures

Figures reproduced from arXiv: 2605.14552 by Dong Chen, Fang Li, Haoyang Tong, Jingling Fu, Junshi Huang, Lichen Ma, Luohang Liu, Xinyuan Shan, Yan Li, Yu He.

**Figure 2.** Figure 2: Illustration of pass and fail examples from [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Data distribution and samples of LiWi-100k. [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Effect of the shadow layer. The shadow layer records foreground-related lighting changes, [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 6.** Figure 6: Illustration of the restoration process from degraded regions to the natural image manifold. 4.2 Degraded Boundary Refinement In the layer generation task, given the ground-truth image x0 ∈ {S} ∪ B ∪ F, the flow-matching [33] method constructs a linear path that transports a Gaussian sample ϵ to image x0. The latent representation at time step t ∈ [0, 1] is defined via linear interpolation: zt = (1 − t)ϵ… view at source ↗

**Figure 7.** Figure 7: Qualitative comparison on in-the-wild layer decomposition. [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗

**Figure 8.** Figure 8: Layer decomposition guided by visual prompt. [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗

**Figure 9.** Figure 9: The degraded layer is obtained by expanding the original image region and then applying [PITH_FULL_IMAGE:figures/full_fig_p014_9.png] view at source ↗

**Figure 10.** Figure 10: Results of LiWi framework on the test set of LiWi-100k. For various natural scenes [PITH_FULL_IMAGE:figures/full_fig_p015_10.png] view at source ↗

**Figure 11.** Figure 11: Visualization of the Liwi dataset with 2 and 3 layers. As shown, in diverse scenes, our [PITH_FULL_IMAGE:figures/full_fig_p016_11.png] view at source ↗

**Figure 12.** Figure 12: Visualization of the LiWi-100k dataset across multiple layers and aspect ratios. As the [PITH_FULL_IMAGE:figures/full_fig_p017_12.png] view at source ↗

read the original abstract

Recent advances in generative models have empowered impressive layered image generation, yet their success is largely confined to graphic design domains. The layering of in-the-wild images remains an underexplored problem, limiting fine-grained editing and applications of images in real-world scenarios. Specifically, challenges remain in scalable layered data and the modeling of object interaction in natural images, such as illumination effects and structural boundary. To address these bottlenecks, we propose a novel framework for high-fidelity natural image decomposition. First, we introduce an Agent-driven Data Decomposition (ADD) pipeline that orchestrates agents and tools to synthesize layered data without manual intervention. Utilizing this pipeline, we construct a large-scale dataset, named LiWi-100k, with over 100,000 high-quality layered in-the-wild images. Second, we present a novel framework that jointly improves photometric fidelity and alpha boundary accuracy. Specifically, shadow-guided learning explicitly models the illumination effects, and degradation-restoration objective provides boundary-correction supervision by recovering clean foreground image from degraded one. Extensive experiments demonstrate that our framework achieves state-of-the-art (SoTA) performance in natural image decomposition, outperforming existing models in RGB L1 and Alpha IoU metrics. We will soon release our code and dataset.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's main contribution is an agent-orchestrated pipeline that generates a 100k synthetic dataset for natural-image layering plus shadow and restoration losses, but the SoTA numbers sit entirely on that unvalidated synthetic ground truth.

read the letter

The paper introduces LiWi-100k, a dataset of over 100k layered natural images built by an Agent-driven Data Decomposition pipeline that chains agents and tools to produce layers without manual labeling. It pairs this with a model that adds shadow-guided learning to capture illumination and a degradation-restoration objective to sharpen alpha boundaries. That specific combination of scalable synthesis and those two supervision signals is new relative to the cited prior work on graphic-design layering. The approach directly targets the data bottleneck and the real-world effects that break existing methods on photos, which is a practical step forward. The framing of the problem and the choice of objectives are clear and well-motivated. The soft spot is that every performance claim rests on the synthetic dataset alone. The abstract states better RGB L1 and Alpha IoU than existing models but shows no tables, no baselines, no error bars, and no test on real photos. There is also no reported fidelity check against human annotations on the same source images, so any systematic error in how the agents handle shadows or edges would propagate straight into the metrics. The paper is aimed at researchers working on layered representations for editing and generative applications. Someone already thinking about data pipelines or boundary-aware losses could extract usable ideas even if the current numbers need more scrutiny. It deserves peer review once the dataset and full results are released, because the synthesis route and the loss design address a genuine gap even if the present evidence is still thin.

Referee Report

3 major / 1 minor

Summary. The manuscript introduces the LiWi framework for high-fidelity decomposition of natural images into layers. It proposes an Agent-driven Data Decomposition (ADD) pipeline that orchestrates agents and tools to synthesize layered ground truth without manual intervention, yielding the LiWi-100k dataset of over 100,000 in-the-wild images. The model incorporates shadow-guided learning to explicitly model illumination effects and a degradation-restoration objective to supervise boundary accuracy by recovering clean foregrounds from degraded inputs. Experiments are reported to demonstrate state-of-the-art performance on RGB L1 and Alpha IoU metrics, outperforming prior models.

Significance. If the synthetic ground truth faithfully reproduces real-world illumination, shadow interactions, and boundaries, and if the performance gains generalize beyond the authors' dataset, the work could meaningfully advance layered image decomposition for natural scenes. This addresses a clear gap relative to generative models that succeed mainly in graphic-design domains and could enable more precise fine-grained editing applications.

major comments (3)

[ADD Pipeline and Dataset Construction] The headline SoTA claims on RGB L1 and Alpha IoU are measured exclusively on the LiWi-100k dataset produced by the ADD pipeline. No quantitative fidelity evaluation (e.g., agreement with human annotations on the same source photographs) or error-propagation analysis from agent mistakes is reported, which is load-bearing for any claim of superior performance on natural images.
[Experiments] The abstract asserts SoTA results yet supplies no quantitative tables, baseline comparisons, error bars, or cross-dataset generalization tests. Without these, it is impossible to verify the magnitude or robustness of the reported improvements.
[Proposed Framework] The shadow-guided learning and degradation-restoration objectives are described at a high level but lack explicit loss formulations, network details, or ablation studies isolating their contribution to photometric fidelity and boundary accuracy.

minor comments (1)

[Abstract] The promise to release code and dataset is stated without a timeline, repository link, or licensing information.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and describe the revisions we will make to strengthen the manuscript.

read point-by-point responses

Referee: [ADD Pipeline and Dataset Construction] The headline SoTA claims on RGB L1 and Alpha IoU are measured exclusively on the LiWi-100k dataset produced by the ADD pipeline. No quantitative fidelity evaluation (e.g., agreement with human annotations on the same source photographs) or error-propagation analysis from agent mistakes is reported, which is load-bearing for any claim of superior performance on natural images.

Authors: We agree that direct validation of the synthetic ground truth against human annotations is important for supporting claims on natural images. The current version does not include such a quantitative fidelity study or explicit error-propagation analysis. In the revised manuscript we will add a human evaluation on a random subset of 500 source photographs, reporting agreement metrics for layer boundaries, shadows, and overall decomposition quality. We will also include an error-propagation study that injects controlled agent mistakes and measures downstream impact on RGB L1 and Alpha IoU. These additions will appear in a new subsection of the Experiments section. revision: yes
Referee: [Experiments] The abstract asserts SoTA results yet supplies no quantitative tables, baseline comparisons, error bars, or cross-dataset generalization tests. Without these, it is impossible to verify the magnitude or robustness of the reported improvements.

Authors: The full manuscript contains tables with baseline comparisons, multiple-run error bars, and statistical significance tests on the LiWi-100k test split. These details were omitted from the abstract for brevity. We will revise the abstract to report the key quantitative gains (e.g., RGB L1 and Alpha IoU deltas versus the strongest baseline). In addition, we will add a cross-dataset generalization experiment on an external natural-image set (e.g., a held-out portion of COCO or Adobe FiveK with manually verified layers) and report the corresponding metrics with error bars. revision: yes
Referee: [Proposed Framework] The shadow-guided learning and degradation-restoration objectives are described at a high level but lack explicit loss formulations, network details, or ablation studies isolating their contribution to photometric fidelity and boundary accuracy.

Authors: We will expand the Method section with the precise loss equations for both the shadow-guided term (including the illumination modeling loss) and the degradation-restoration objective. A new subsection will detail the network architecture, layer dimensions, and training hyperparameters. We will also add ablation tables that isolate each objective’s contribution to photometric fidelity (RGB L1) and boundary accuracy (Alpha IoU), together with qualitative examples showing the effect of removing each component. revision: yes

Circularity Check

0 steps flagged

No circularity: dataset synthesis and objectives are independent contributions

full rationale

The paper's core claims rest on an Agent-driven Data Decomposition pipeline that generates the LiWi-100k dataset and a framework using shadow-guided learning plus degradation-restoration objectives. No equations, derivations, or fitted parameters are shown to reduce by construction to the inputs; the SoTA results on RGB L1 and Alpha IoU are empirical measurements on the newly synthesized data rather than tautological predictions. No self-citation chains, uniqueness theorems, or ansatz smuggling appear in the derivation. The pipeline and objectives constitute genuine new engineering steps whose validity can be assessed externally via fidelity checks or human annotations, keeping the argument self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claim depends on the unverified assumption that the ADD pipeline yields accurate layered ground truth and that the two new objectives faithfully model illumination and boundaries in natural scenes.

axioms (1)

domain assumption Natural images admit accurate layered decompositions with alpha mattes that capture object interactions including illumination and boundaries.
Core premise enabling the decomposition task and evaluation metrics.

invented entities (2)

Agent-driven Data Decomposition (ADD) pipeline no independent evidence
purpose: Automatic synthesis of layered natural image data without manual labeling
New mechanism introduced to scale the dataset.
LiWi-100k dataset no independent evidence
purpose: Large-scale training and evaluation resource for natural image layering
Constructed via the ADD pipeline.

pith-pipeline@v0.9.0 · 5541 in / 1334 out tokens · 52456 ms · 2026-05-15T01:58:02.797406+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We propose a novel framework for high-fidelity natural image decomposition... shadow-guided learning explicitly models the illumination effects, and degradation-restoration objective provides boundary-correction supervision
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

The shadow layer S is defined as the residual between the source image Isrc and the recomposed image Ic

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

41 extracted references · 41 canonical work pages · 6 internal anchors

[1]

Layered neural atlases for consistent video editing.ACM Transactions on Graphics, 40(6):1–12, 2021

Yoni Kasten, Dolev Ofri, Oliver Wang, and Tali Dekel. Layered neural atlases for consistent video editing.ACM Transactions on Graphics, 40(6):1–12, 2021

work page 2021
[2]

Text2live: Text-driven layered image and video editing

Omer Bar-Tal, Dolev Ofri-Amar, Rafail Fridman, Yoni Kasten, and Tali Dekel. Text2live: Text-driven layered image and video editing. InEuropean Conference on Computer Vision, pages 707–723, 2022

work page 2022
[3]

Shape- aware text-driven layered video editing

Yao-Chih Lee, Ji-Ze Genevieve Jang, Yi-Ting Chen, Elizabeth Qiu, and Jia-Bin Huang. Shape- aware text-driven layered video editing. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14317–14326, 2023

work page 2023
[4]

Resolution-robust large mask inpainting with fourier convolutions

Roman Suvorov, Elizaveta Logacheva, Anton Mashikhin, Anastasia Remizova, Arsenii Ashukha, Aleksei Silvestrov, Naejin Kong, Harshith Goka, Kiwoong Park, and Victor Lempitsky. Resolution-robust large mask inpainting with fourier convolutions. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 2149–2159, 2022

work page 2022
[5]

Layerd: Decomposing raster graphic designs into layers

Tomoyuki Suzuki, Kang-Jun Liu, Naoto Inoue, and Kota Yamaguchi. Layerd: Decomposing raster graphic designs into layers. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 17783–17792, 2025

work page 2025
[6]

Qwen-image-layered: Towards inherent editability via layer decomposition.arXiv preprint arXiv:2512.15603, 2025

Shengming Yin, Zekai Zhang, Zecheng Tang, Kaiyuan Gao, Xiao Xu, Kun Yan, Jiahao Li, Yilei Chen, Yuxiang Chen, Heung-Yeung Shum, et al. Qwen-image-layered: Towards inherent editability via layer decomposition.arXiv preprint arXiv:2512.15603, 2025

work page arXiv 2025
[7]

Omnipsd: Layered psd generation with diffusion transformer.arXiv preprint arXiv:2512.09247, 2025

Cheng Liu, Yiren Song, Haofan Wang, and Mike Zheng Shou. Omnipsd: Layered psd generation with diffusion transformer.arXiv preprint arXiv:2512.09247, 2025

work page arXiv 2025
[8]

A survey on intrinsic images: Delving deep into lambert and beyond.International Journal of Computer Vision, 130:836–868, 2022

Elena Garces, Carlos Rodriguez-Pardo, Dan Casas, and Jorge Lopez-Moreno. A survey on intrinsic images: Delving deep into lambert and beyond.International Journal of Computer Vision, 130:836–868, 2022

work page 2022
[9]

Canvasvae: Learning to generate vector graphic documents

Kota Yamaguchi. Canvasvae: Learning to generate vector graphic documents. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 5481–5489, 2021

work page 2021
[10]

Deepprimitive: Image decomposition by layered primitive detection

Jiahui Huang, Jun Gao, Vignesh Ganapathi-Subramanian, Hao Su, Yin Liu, Chengcheng Tang, and Leonidas J Guibas. Deepprimitive: Image decomposition by layered primitive detection. Computational Visual Media, 4(4):385–397, 2018

work page 2018
[11]

Generative image layer decomposition with visual effects

Jinrui Yang, Qing Liu, Yijun Li, Soo Ye Kim, Daniil Pakhomov, Mengwei Ren, Jianming Zhang, Zhe Lin, Cihang Xie, and Yuyin Zhou. Generative image layer decomposition with visual effects. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7643–7653, 2025

work page 2025
[12]

Text2layer: Layered image generation using latent diffusion model.arXiv preprint arXiv:2307.09781, 2023

Xinyang Zhang, Wentian Zhao, Xin Lu, and Jeff Chien. Text2layer: Layered image generation using latent diffusion model.arXiv preprint arXiv:2307.09781, 2023

work page arXiv 2023
[13]

Layerdiff: Exploring text-guided multi-layered composable image synthesis via layer-collaborative diffusion model

Runhui Huang, Kaixin Cai, Jianhua Han, Xiaodan Liang, Renjing Pei, Guansong Lu, Songcen Xu, Wei Zhang, and Hang Xu. Layerdiff: Exploring text-guided multi-layered composable image synthesis via layer-collaborative diffusion model. InEuropean Conference on Computer Vision, pages 144–160. Springer, 2024

work page 2024
[14]

Art: Anonymous region transformer for variable multi-layer transparent image generation

Yifan Pu, Yiming Zhao, Zhicong Tang, Ruihong Yin, Haoxing Ye, Yuhui Yuan, Dong Chen, Jianmin Bao, Sirui Zhang, Yanbin Wang, et al. Art: Anonymous region transformer for variable multi-layer transparent image generation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 7952–7962, 2025. 10

work page 2025
[15]

Mulan: A multi layer annotated dataset for controllable text-to-image generation

Petru-Daniel Tudosiu, Yongxin Yang, Shifeng Zhang, Fei Chen, Steven McDonagh, Gerasimos Lampouras, Ignacio Iacobacci, and Sarah Parisot. Mulan: A multi layer annotated dataset for controllable text-to-image generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22413–22422, 2024

work page 2024
[16]

Qwen-Image Technical Report

Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al. Qwen-image technical report.arXiv preprint arXiv:2508.02324, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[17]

Rethinking layered graphic design generation with a top-down approach

Jingye Chen, Zhaowen Wang, Nanxuan Zhao, Li Zhang, Difan Liu, Jimei Yang, and Qifeng Chen. Rethinking layered graphic design generation with a top-down approach. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 16861–16870, 2025

work page 2025
[18]

A Unified and Controllable Framework for Layered Image Generation with Visual Effects

Jinrui Yang, Qing Liu, Yijun Li, Mengwei Ren, Letian Zhang, Zhe Lin, Cihang Xie, and Yuyin Zhou. Controllable layered image generation for real-world editing.arXiv preprint arXiv:2601.15507, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[19]

Referring layer decomposition.arXiv preprint arXiv:2602.19358, 2026

Fangyi Chen, Yaojie Shen, Lu Xu, Ye Yuan, Shu Zhang, Yulei Niu, and Longyin Wen. Referring layer decomposition.arXiv preprint arXiv:2602.19358, 2026

work page arXiv 2026
[20]

Transparent image layer diffusion using latent trans- parency.arXiv preprint arXiv:2402.17113, 2024

Lvmin Zhang and Maneesh Agrawala. Transparent image layer diffusion using latent trans- parency.arXiv preprint arXiv:2402.17113, 2024

work page arXiv 2024
[21]

Dreamlayer: Simultaneous multi-layer generation via diffusion model

Junjia Huang, Pengxiang Yan, Jinhang Cai, Jiyang Liu, Zhao Wang, Yitong Wang, Xinglong Wu, and Guanbin Li. Dreamlayer: Simultaneous multi-layer generation via diffusion model. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3357–3366, 2025

work page 2025
[22]

Psdiffusion: Harmonized multi-layer image generation via layout and appearance alignment

Dingbang Huang, Wenbo Li, Yifei Zhao, Xinyu Pan, Yanhong Zeng, and Bo Dai. Psdiffusion: Harmonized multi-layer image generation via layout and appearance alignment. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 3233–3242, 2026

work page 2026
[23]

Layeringdiff: Layered image synthesis via generation, then disassembly with generative knowledge.arXiv preprint arXiv:2501.01197, 2025

Kyoungkook Kang, Gyujin Sim, Geonung Kim, Donguk Kim, Seungho Nam, and Sunghyun Cho. Layeringdiff: Layered image synthesis via generation, then disassembly with generative knowledge.arXiv preprint arXiv:2501.01197, 2025

work page arXiv 2025
[24]

Layerfusion: Harmonized multi-layer text-to-image generation with generative priors

Yusuf Dalva, Yijun Li, Qing Liu, Nanxuan Zhao, Jianming Zhang, Zhe Lin, and Pinar Yanardag. Layerfusion: Harmonized multi-layer text-to-image generation with generative priors. In NeurIPS 2025 Workshop on Space in Vision, Language, and Embodied AI, 2024

work page 2025
[25]

From inpainting to layer decomposition: Repurposing generative inpainting models for image layer decomposition.arXiv preprint arXiv:2511.20996, 2025

Jingxi Chen, Yixiao Zhang, Xiaoye Qian, Zongxia Li, Cornelia Fermuller, Caren Chen, and Yiannis Aloimonos. From inpainting to layer decomposition: Repurposing generative inpainting models for image layer decomposition.arXiv preprint arXiv:2511.20996, 2025

work page arXiv 2025
[26]

Datasetagent: A novel multi-agent system for auto-constructing datasets from real-world images

Haoran Sun, Haoyu Bian, Shaoning Zeng, Yunbo Rao, Xu Xu, Lin Mei, and Jianping Gou. Datasetagent: A novel multi-agent system for auto-constructing datasets from real-world images. arXiv preprint arXiv:2507.08648, 2025

work page arXiv 2025
[27]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[28]

Flux.2-klein-9b, 2026

Black Forest Labs. Flux.2-klein-9b, 2026. URL https://bfl.ai/models/flux-2-klein. Accessed: 2026-04-27

work page 2026
[29]

Rmbg-1.4: Background removal model, 2024

BRIA AI. Rmbg-1.4: Background removal model, 2024. URL https://huggingface.co/ briaai/RMBG-1.4. Accessed: 2026-04-27

work page 2024
[30]

Bilateral reference for high-resolution dichotomous image segmentation.CAAI Artificial Intelligence Research, 2024

Peng Zheng, Dehong Gao, Deng-Ping Fan, Li Liu, Jorma Laaksonen, Wanli Ouyang, and Nicu Sebe. Bilateral reference for high-resolution dichotomous image segmentation.CAAI Artificial Intelligence Research, 2024. 11

work page 2024
[31]

Rmbg-2.0: Background removal model, 2024

BRIA AI. Rmbg-2.0: Background removal model, 2024. URL https://huggingface.co/ briaai/RMBG-2.0. Accessed: 2026-04-27

work page 2024
[32]

SAM 3: Segment Anything with Concepts

Nicolas Carion, Laura Gustafson, Yuan-Ting Hu, Shoubhik Debnath, Ronghang Hu, Didac Suris, Chaitanya Ryali, Kalyan Vasudev Alwala, Haitham Khedr, Andrew Huang, et al. Sam 3: Segment anything with concepts.arXiv preprint arXiv:2511.16719, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[33]

Flow Matching for Generative Modeling

Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[34]

Adam: A Method for Stochastic Optimization

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[35]

Highly accurate dichotomous image segmentation

Xuebin Qin, Hang Dai, Xiaobin Hu, Deng-Ping Fan, Ling Shao, et al. Highly accurate dichotomous image segmentation. 2022

work page 2022
[36]

Camouflaged object detection

Deng-Ping Fan, Ge-Peng Ji, Guolei Sun, Ming-Ming Cheng, Jianbing Shen, and Ling Shao. Camouflaged object detection. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2777–2787, 2020

work page 2020
[37]

U2-net: Going deeper with nested u-structure for salient object detection

Xuebin Qin, Zichen Zhang, Chenyang Huang, Masood Dehghan, Osmar R Zaiane, and Martin Jagersand. U2-net: Going deeper with nested u-structure for salient object detection. 106: 107404, 2020

work page 2020
[38]

Deep high-resolution representation learning for visual recognition

Jingdong Wang, Ke Sun, Tianheng Cheng, Borui Jiang, Chaorui Deng, Yang Zhao, Dong Liu, Yadong Mu, Mingkui Tan, Xinggang Wang, et al. Deep high-resolution representation learning for visual recognition. 43(10):3349–3364, 2020

work page 2020
[39]

Pyramid grafting network for one-stage high resolution saliency detection

Chenxi Xie, Changqun Xia, Mingcan Ma, Zhirui Zhao, Xiaowu Chen, and Jia Li. Pyramid grafting network for one-stage high resolution saliency detection. 2022

work page 2022
[40]

Dichotomous image segmentation with frequency priors

Yan Zhou, Bo Dong, Yuanfeng Wu, Wentao Zhu, Geng Chen, and Yanning Zhang. Dichotomous image segmentation with frequency priors. 2023

work page 2023
[41]

Bilateral reference for high-resolution dichotomous image segmentation.arXiv preprint arXiv:2401.03407, 2024

Peng Zheng, Dehong Gao, Deng-Ping Fan, Li Liu, Jorma Laaksonen, Wanli Ouyang, and Nicu Sebe. Bilateral reference for high-resolution dichotomous image segmentation.arXiv preprint arXiv:2401.03407, 2024. 12 A Complete Zero-Shot Foreground Segmentation Results on Real-World Data We compare our method with various foreground segmentation approaches on the fo...

work page arXiv 2024