Gen-n-Val: Agentic Image Data Generation and Validation

Chih-Yu Wang; I-Sheng Fang; Jing-En Huang; Jun-Cheng Chen; Tzuhsuan Huang; Yu-Lun Liu

arxiv: 2506.04676 · v2 · submitted 2025-06-05 · 💻 cs.CV · cs.AI· cs.LG· cs.MA

Gen-n-Val: Agentic Image Data Generation and Validation

Jing-En Huang , I-Sheng Fang , Tzuhsuan Huang , Yu-Lun Liu , Chih-Yu Wang , Jun-Cheng Chen This is my paper

Pith reviewed 2026-05-19 11:26 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.LGcs.MA

keywords synthetic data generationagentic frameworkinstance segmentationobject detectionlong-tailed distributionLLMVLLMdata validation

0 comments

The pith

Gen-n-Val uses two LLM agents to generate and filter synthetic images, cutting invalid data from 50% to 7% and lifting rare-class performance in instance segmentation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Gen-n-Val as a new agentic framework for creating synthetic training data aimed at object detection and instance segmentation tasks. It deploys an LLM-based prompt agent to refine inputs for Layer Diffusion so that it produces clean single-object foreground images with accurate masks, then applies a VLLM-based validation agent to discard low-quality or mislabeled outputs. Both agents have their system prompts refined through TextGrad optimization. A sympathetic reader would care because long-tailed datasets like LVIS suffer from severe data scarcity and label noise on rare categories, and higher-quality synthetic data offers a scalable way to improve model training without relying solely on expensive manual annotation.

Core claim

Gen-n-Val consists of an LD prompt agent that optimizes prompts for Layer Diffusion to generate high-quality single-object images and corresponding segmentation masks, paired with a data validation agent that uses a VLLM to filter out low-quality synthetic instances. System prompts for both agents are optimized via TextGrad. When used to augment training, the method reduces the proportion of invalid synthetic data from 50% to 7%, yielding a 7.6% gain on rare classes for LVIS instance segmentation with Mask R-CNN and a 3.6% mAP gain on rare classes for COCO with YOLOv9c and YOLO11m, plus a 7.1% mAP improvement over YOLO-Worldv2-M in open-vocabulary detection.

What carries the argument

The two-agent pipeline: an LLM prompt optimizer that steers Layer Diffusion toward single-object, accurately masked outputs, and a VLLM filter that removes low-quality or mislabeled instances, with TextGrad used to refine the agents' instructions.

If this is right

Training Mask R-CNN on LVIS with Gen-n-Val data produces higher accuracy on rare classes than training with prior synthetic methods like MosaicFusion.
The same filtered synthetic data improves mAP on rare classes when used with YOLOv9c and YOLO11m on COCO.
Gen-n-Val yields measurable gains in open-vocabulary object detection when paired with YOLO11m.
The framework demonstrates scalability as model capacity and training dataset size increase.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The validation agent could be swapped for other vision-language models or combined with human review to further lower the residual error rate.
Mixing Gen-n-Val outputs with real images at varying ratios might produce even stronger balanced training sets for long-tailed problems.
The prompt-optimization loop could be extended to additional diffusion backbones or to video and 3D data generation tasks.

Load-bearing premise

The VLLM validation agent can reliably detect and remove low-quality or mislabeled synthetic instances without introducing new systematic biases or missing subtle errors that would degrade downstream model training.

What would settle it

A controlled experiment that trains the same downstream detector twice—once with the full Gen-n-Val filtered set and once with the same volume of unfiltered synthetic data—and measures whether the reported performance gap on rare classes disappears.

Figures

Figures reproduced from arXiv: 2506.04676 by Chih-Yu Wang, I-Sheng Fang, Jing-En Huang, Jun-Cheng Chen, Tzuhsuan Huang, Yu-Lun Liu.

**Figure 1.** Figure 1: Left: a sample of synthetic data generated by Gen-n-Val. The blue-masked area is labeled “airplane”. The orange-masked area is labeled “cow”. The yellow-masked area is labeled “broccoli”. The purple-masked area is labeled “truck”. The purple-masked with dots area is labeled “umbrella”. The red-masked area is labeled “bird”. The red-masked area is labeled “tie”. Middle: The performance comparison chart for … view at source ↗

**Figure 2.** Figure 2: A low-quality sample in the synthetic dataset generated [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: The samples generated by Layer Diffusion [ [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

**Figure 4.** Figure 4: The failure samples generated by Layer Diffusion [ [PITH_FULL_IMAGE:figures/full_fig_p003_4.png] view at source ↗

**Figure 5.** Figure 5: Pipeline of previous generative-based augmentation methods [ [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗

**Figure 6.** Figure 6: Pipeline of Gen-n-Val. The process begins with the optimization of the system prompts of the LD prompt agent (a LLM) using [PITH_FULL_IMAGE:figures/full_fig_p005_6.png] view at source ↗

read the original abstract

The data scarcity, label noise, and long-tailed category imbalance remain important and unresolved challenges in many computer vision tasks, such as object detection and instance segmentation, especially on large-vocabulary benchmarks like LVIS, where most categories appear in only a few images. Current synthetic data generation methods still suffer from multiple objects per mask, inaccurate segmentation, incorrect category labels, and other issues, limiting their effectiveness. To address these issues, we introduce Gen-n-Val, a novel agentic data generation framework that leverages Layer Diffusion (LD), a Large Language Model (LLM), and a Vision Large Language Model (VLLM) to produce high-quality and diverse instance masks and images for object detection and instance segmentation. Gen-n-Val consists of two agents: (1) the LD prompt agent, an LLM, optimizes rompts to encourage LD to generate high-quality foreground single-object images and corresponding segmentation masks; and (2) the data validation agent, a VLLM, filters out low-quality synthetic instance images. The system prompts for both agents are optimized by TextGrad. Compared to state-of-the-art synthetic data approaches like MosaicFusion, our approach reduces invalid synthetic data from 50% to 7% and improves performance by 7.6% on rare classes in LVIS instance segmentation with Mask R-CNN, and by 3.6% mAP on rare classes in COCO instance segmentation with YOLOv9c and YOLO11m. Furthermore, Gen-n-Val shows significant improvements (7.1% mAP) over YOLO-Worldv2-M in open-vocabulary object detection benchmarks with YOLO11m. Moreover, Gen-n-Val has scalability in model capacity and dataset size. The code is available at https://github.com/aiiu-lab/Gen-n-Val.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Gen-n-Val's two-agent pipeline with TextGrad-tuned prompts cuts invalid synthetic data and boosts rare-class results, but the VLLM filter lacks the checks needed to trust those gains fully.

read the letter

The core thing here is a practical agentic setup: one LLM agent optimizes prompts for Layer Diffusion to produce single-object images and masks, while a VLLM agent filters out low-quality or mislabeled outputs. Both prompts get tuned via TextGrad. They report dropping invalid data from 50% to 7%, plus 7.6% rare-class AP lift on LVIS with Mask R-CNN and 3.6% mAP on COCO rare classes with YOLO models, along with open-vocab gains and some scalability notes. Code is released too. That combination of LD generation plus automated validation is the new piece relative to earlier synthetic methods like MosaicFusion. It directly targets the long-tailed and label-noise problems that still limit detectors and segmentors on big-vocab benchmarks. The empirical comparisons on public datasets give a clear before-and-after picture, and the focus on single-object masks plus prompt optimization feels like a useful engineering step. The soft spot is the validation agent. The reported improvements depend on it correctly catching bad instances without adding new biases or missing subtle rare-class failures. The paper does not include a human agreement study, per-category breakdown of accepted versus rejected samples, or an ablation that trains on the same volume of unfiltered data. Without those, it is hard to separate real quality gains from distribution shifts. Experimental controls and statistical details are also thin in the abstract. This is aimed at CV researchers working on data augmentation for detection and segmentation, especially anyone dealing with imbalance in LVIS-style settings. A reader who needs concrete ideas for synthetic pipelines would get value from the architecture and the numbers, even if they want to re-run the validation checks themselves. It shows honest engagement with the problem and prior work, so it deserves a serious referee rather than a desk reject. I would bring it to a reading group to walk through the agent prompts and results. I would not cite it in my own work until the filter is better validated. Recommendation: send it to peer review.

Referee Report

2 major / 2 minor

Summary. The paper introduces Gen-n-Val, an agentic framework for synthetic data generation and validation in object detection and instance segmentation. It employs a Layer Diffusion (LD) prompt agent (LLM) to optimize prompts for high-quality single-object foreground images and masks, paired with a VLLM-based data validation agent to filter low-quality or mislabeled instances; both agents use TextGrad for prompt optimization. On public benchmarks, the method claims to reduce invalid synthetic data from 50% to 7% versus MosaicFusion, yielding +7.6% rare-class AP on LVIS instance segmentation (Mask R-CNN), +3.6% mAP on rare classes in COCO (YOLOv9c/YOLO11m), and +7.1% mAP over YOLO-Worldv2-M in open-vocabulary detection, with noted scalability to model capacity and dataset size. Code is released.

Significance. If the validation step reliably improves data quality without introducing distribution shifts or undetected errors, the framework offers a practical, scalable route to mitigate long-tailed data scarcity and label noise in large-vocabulary CV tasks. The agentic TextGrad optimization and public code release are concrete strengths that could support reproducibility and extension.

major comments (2)

[Experimental Results (likely §4–5 and associated tables)] The headline claims of invalid-data reduction (50%→7%) and rare-class gains (7.6% LVIS, 3.6% COCO) rest entirely on the data validation agent's filtering decisions, yet the manuscript provides no quantitative human-agreement study, per-category error breakdown of accepted vs. rejected samples, or ablation training on an equal volume of unfiltered synthetic data. Without these, it remains possible that reported improvements arise from altered class balance rather than genuine quality gains.
[§4 (Experiments) and Table 1–3] Limited detail is given on experimental controls: baseline implementations (e.g., exact MosaicFusion configuration), statistical significance testing, or potential confounds such as total training sample count after filtering. These omissions make it hard to isolate the contribution of the validation agent from other factors.

minor comments (2)

[Abstract] Abstract contains a typo: 'optimizes rompts' should read 'optimizes prompts'.
[§3 (Method)] Notation for the two agents ('LD prompt agent' and 'data validation agent') is introduced without a clear diagram or pseudocode in the method section, making the pipeline harder to follow on first reading.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed comments. We address each major comment below and indicate the revisions we will make to strengthen the experimental rigor and clarity of the manuscript.

read point-by-point responses

Referee: [Experimental Results (likely §4–5 and associated tables)] The headline claims of invalid-data reduction (50%→7%) and rare-class gains (7.6% LVIS, 3.6% COCO) rest entirely on the data validation agent's filtering decisions, yet the manuscript provides no quantitative human-agreement study, per-category error breakdown of accepted vs. rejected samples, or ablation training on an equal volume of unfiltered synthetic data. Without these, it remains possible that reported improvements arise from altered class balance rather than genuine quality gains.

Authors: We agree that these analyses would provide stronger evidence that the observed gains stem from improved data quality rather than shifts in class distribution. The invalid-rate reduction is currently quantified via the validation agent's decisions, and performance is measured on the resulting filtered sets. In the revised manuscript we will add: (1) a human-agreement study on a random subset of accepted and rejected samples, (2) a per-category error breakdown of the agent's decisions, and (3) an ablation that trains on an equal volume of unfiltered synthetic data to isolate the filtering effect. revision: yes
Referee: [§4 (Experiments) and Table 1–3] Limited detail is given on experimental controls: baseline implementations (e.g., exact MosaicFusion configuration), statistical significance testing, or potential confounds such as total training sample count after filtering. These omissions make it hard to isolate the contribution of the validation agent from other factors.

Authors: We acknowledge that greater experimental transparency is needed. In the revision we will: (1) supply the precise configuration parameters used for MosaicFusion and all other baselines, (2) report statistical significance tests (e.g., bootstrap or paired t-tests) for the key performance deltas, and (3) explicitly state the total number of training instances before and after filtering so that readers can assess potential confounds. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical framework evaluated on external benchmarks

full rationale

The paper introduces an agentic pipeline (LD prompt agent + VLLM validation agent, both optimized via TextGrad) for synthetic data generation and filtering, then reports empirical gains on LVIS and COCO against external baselines such as MosaicFusion. No equations, fitted parameters, or derivation steps are present that reduce to self-defined quantities or self-citations; performance metrics derive from direct measurement on public test sets rather than from any internal loop or renamed fit. The central claims therefore remain independent of the paper's own inputs.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 2 invented entities

The framework relies on the capabilities of existing models (Layer Diffusion, LLMs, VLLMs) and introduces new agent roles without additional invented physical entities or heavy parameter fitting beyond standard training.

free parameters (1)

TextGrad optimization settings for agent prompts
Hyperparameters controlling prompt refinement for the LD and validation agents.

axioms (1)

domain assumption Layer Diffusion can produce accurate single-object foreground images and corresponding segmentation masks when given suitably optimized prompts.
Invoked as the foundation for the LD prompt agent's operation.

invented entities (2)

LD prompt agent no independent evidence
purpose: Optimizes prompts to encourage high-quality single-object generation from Layer Diffusion.
New agent role introduced in the Gen-n-Val framework.
Data validation agent no independent evidence
purpose: Filters low-quality synthetic instance images using VLLM.
New quality-control component in the framework.

pith-pipeline@v0.9.0 · 5887 in / 1411 out tokens · 72163 ms · 2026-05-19T11:26:43.870152+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Gen-n-Val consists of two agents: (1) the LD prompt agent, an LLM, optimizes prompts to encourage LD to generate high-quality foreground single-object images... (2) the data validation agent, a VLLM, filters out low-quality synthetic instance images.
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean J_uniquely_calibrated_via_higher_derivative unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

reduces invalid synthetic data from 50% to 7% and improves performance by 7.6% on rare classes

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Leveraging Image Generators to Address Training Data Scarcity: The Gen4Regen Dataset for Forest Regeneration Mapping
cs.CV 2026-05 conditional novelty 6.0

Mixing real UAV imagery with 2101 AI-generated image-mask pairs improves semantic segmentation F1 scores for fine-grained forest species by over 15 percentage points overall and up to 30 points for rare classes.

Reference graph

Works this paper leans on

83 extracted references · 83 canonical work pages · cited by 1 Pith paper · 6 internal anchors

[1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ah- mad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Label-efficient se- mantic segmentation with diffusion models

Dmitry Baranchuk, Ivan Rubachev, Andrey V oynov, Valentin Khrulkov, and Artem Babenko. Label-efficient se- mantic segmentation with diffusion models. InICCV, 2022. 2, 3, 4

work page 2022
[3]

Large scale gan training for high fidelity natural image synthesis

Andrew Brock, Jeff Donahue, and Karen Simonyan. Large scale gan training for high fidelity natural image synthesis. InICLR, 2019. 3

work page 2019
[4]

Language models are few-shot learners.NeurIPS, 33:1877– 1901, 2020

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Sub- biah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakan- tan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners.NeurIPS, 33:1877– 1901, 2020. 1, 4

work page 1901
[5]

Yolo-world: Real-time open- vocabulary object detection

Tianheng Cheng, Lin Song, Yixiao Ge, Wenyu Liu, Xing- gang Wang, and Ying Shan. Yolo-world: Real-time open- vocabulary object detection. InCVPR, 2024. 8

work page 2024
[6]

Gonzalez, Ion Stoica, and Eric P

Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhang- hao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yong- hao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, 2023. 4

work page 2023
[7]

The Llama 3 Herd of Models

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Ab- hishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

Mod- eling visual context is key to augmenting object detection datasets

Nikita Dvornik, Julien Mairal, and Cordelia Schmid. Mod- eling visual context is key to augmenting object detection datasets. InECCV, 2018. 2, 3, 7

work page 2018
[9]

Cut, paste and learn: Surprisingly easy synthesis for instance de- tection

Debidatta Dwibedi, Ishan Misra, and Martial Hebert. Cut, paste and learn: Surprisingly easy synthesis for instance de- tection. InICCV, 2017. 2, 3

work page 2017
[10]

Taming transformers for high-resolution image synthesis

Patrick Esser, Robin Rombach, and Bj ¨orn Ommer. Taming transformers for high-resolution image synthesis. InCVPR,

work page
[11]

Instaboost: Boosting instance segmentation via probability map guided copy- pasting

Hao-Shu Fang, Jianhua Sun, Runzhong Wang, Minghao Gou, Yong-Lu Li, and Cewu Lu. Instaboost: Boosting instance segmentation via probability map guided copy- pasting. InICCV, 2019. 2, 3

work page 2019
[12]

Cubuk, Quoc V

Golnaz Ghiasi, Yin Cui, Aravind Srinivas, Rui Qian, Tsung- Yi Lin, Ekin D. Cubuk, Quoc V . Le, and Barret Zoph. Simple copy-paste is a strong data augmentation method for instance segmentation. InCVPR, 2021. 2, 3

work page 2021
[13]

Lvis: A dataset for large vocabulary instance segmentation

Agrim Gupta, Piotr Dollar, and Ross Girshick. Lvis: A dataset for large vocabulary instance segmentation. InCVPR,

work page
[14]

Visual program- ming: Compositional visual reasoning without training

Tanmay Gupta and Aniruddha Kembhavi. Visual program- ming: Compositional visual reasoning without training. In CVPR, pages 14953–14962, 2023. 1, 4

work page 2023
[15]

A style-based generator architecture for generative adversarial networks

Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. In CVPR, 2019. 3

work page 2019
[16]

YOLOv11: An Overview of the Key Architectural Enhancements

Rahima Khanam and Muhammad Hussain. Yolov11: An overview of the key architectural enhancements.arXiv preprint arXiv:2410.17725, 2024. 1, 2, 3, 7, 8

work page internal anchor Pith review Pith/arXiv arXiv 2024
[17]

Big- datasetgan: Synthesizing imagenet with pixel-wise annota- tions

Daiqing Li, Huan Ling, Seung Wook Kim, Karsten Kreis, Adela Barriuso, Sanja Fidler, and Antonio Torralba. Big- datasetgan: Synthesizing imagenet with pixel-wise annota- tions. InCVPR, 2022. 2

work page 2022
[18]

Microsoft coco: Common objects in context

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In ECCV, 2014. 7

work page 2014
[19]

Visual instruction tuning, 2023

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning, 2023. 1, 4

work page 2023
[20]

Llama 3.2: Revolutionizing edge ai and vision with open, customizable models.Meta AI Blog

AI Meta. Llama 3.2: Revolutionizing edge ai and vision with open, customizable models.Meta AI Blog. Retrieved December, 20:2024, 2024. 1, 4, 7

work page 2024
[21]

GPT-4V(ision) system card, 2023

OpenAI. GPT-4V(ision) system card, 2023. 4

work page 2023
[22]

O’Brien, Carrie J

Joon Sung Park, Joseph C. O’Brien, Carrie J. Cai, Mered- ith Ringel Morris, Percy Liang, and Michael S. Bernstein. Generative agents: Interactive simulacra of human behavior. InUIST, New York, NY , USA, 2023. Association for Com- puting Machinery. 1, 4

work page 2023
[23]

SDXL: Improving latent diffusion mod- els for high-resolution image synthesis

Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas M ¨uller, Joe Penna, and Robin Rombach. SDXL: Improving latent diffusion mod- els for high-resolution image synthesis. InICLR, 2024. 2, 4

work page 2024
[24]

Learn- ing transferable visual models from natural language super- vision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- ing transferable visual models from natural language super- vision. InICML, pages 8748–8763. PmLR, 2021. 4

work page 2021
[25]

Hierarchical Text-Conditional Image Generation with CLIP Latents

Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image gen- eration with clip latents.arXiv preprint arXiv:2204.06125,

work page internal anchor Pith review Pith/arXiv arXiv
[26]

High-resolution image syn- thesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj¨orn Ommer. High-resolution image syn- thesis with latent diffusion models. InCVPR, 2022. 2, 4, 5

work page 2022
[27]

Berg, and Li Fei-Fei

Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, San- jeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Bernstein Michael, Alexander C. Berg, and Li Fei-Fei. Imagenet large scale visual recognition challenge

work page
[28]

Toolformer: Lan- guage models can teach themselves to use tools.NeurIPS, 36:68539–68551, 2023

Timo Schick, Jane Dwivedi-Yu, Roberto Dess `ı, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Lan- guage models can teach themselves to use tools.NeurIPS, 36:68539–68551, 2023. 1, 4

work page 2023
[29]

HuggingGPT: Solving ai tasks with chatgpt and its friends in hugging face.NeurIPS, 36:38154–38180, 2023

Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, Weiming Lu, and Yueting Zhuang. HuggingGPT: Solving ai tasks with chatgpt and its friends in hugging face.NeurIPS, 36:38154–38180, 2023. 4

work page 2023
[30]

Reflexion: language agents with verbal reinforcement learning

Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik R Narasimhan, and Shunyu Yao. Reflexion: language agents with verbal reinforcement learning. InNeurIPS, 2023. 1

work page 2023
[31]

Prompt guide, 2024

Adam Stewart. Prompt guide, 2024. 5 9

work page 2024
[32]

Gen2det: Generate to detect.arXiv preprint arXiv:2312.04566, 2023

Saksham Suri, Fanyi Xiao, Animesh Sinha, Sean Chang Cu- latana, Raghuraman Krishnamoorthi, Chenchen Zhu, and Abhinav Shrivastava. Gen2det: Generate to detect.arXiv preprint arXiv:2312.04566, 2023. 2, 4

work page arXiv 2023
[33]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth´ee Lacroix, Baptiste Rozi`ere, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023. 1

work page internal anchor Pith review Pith/arXiv arXiv 2023
[34]

Yolov9: Learning what you want to learn using pro- grammable gradient information

Chien-Yao Wang, I-Hau Yeh, and Hong-Yuan Mark Liao. Yolov9: Learning what you want to learn using pro- grammable gradient information. InECCV, 2025. 1, 2, 3, 7, 8

work page 2025
[35]

Semi-supervised parametric real-world image harmonization

Ke Wang, Micha ¨el Gharbi, He Zhang, Zhihao Xia, and Eli Shechtman. Semi-supervised parametric real-world image harmonization. InCVPR, 2023. 2, 4

work page 2023
[36]

Semi-supervised parametric real-world image harmonization

Ke Wang, Micha ¨el Gharbi, He Zhang, Zhihao Xia, and Eli Shechtman. Semi-supervised parametric real-world image harmonization. InCVPR, 2023. 7

work page 2023
[37]

Hu- manoid agents: Platform for simulating human-like gener- ative agents

Zhilin Wang, Yu Ying Chiu, and Yu Cheung Chiu. Hu- manoid agents: Platform for simulating human-like gener- ative agents. InEMNLP, pages 167–176, Singapore, 2023. Association for Computational Linguistics. 1

work page 2023
[38]

Mosaicfusion: Diffusion models as data augmenters for large vocabulary instance segmentation

Jiahao Xie, Wei Li, Xiangtai Li, Ziwei Liu, Yew Soon Ong, and Chen Change Loy. Mosaicfusion: Diffusion models as data augmenters for large vocabulary instance segmentation. IJCV, 2024. 2, 4, 5, 6, 7

work page 2024
[39]

Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments, 2024

Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh Jing Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, Yitao Liu, Yiheng Xu, Shuyan Zhou, Silvio Savarese, Caiming Xiong, Victor Zhong, and Tao Yu. Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments, 2024. 1

work page 2024
[40]

React: Synergizing rea- soning and acting in language models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing rea- soning and acting in language models. InInternational Con- ference on Learning Representations (ICLR), 2023. 1, 4

work page 2023
[41]

TextGrad: Automatic "Differentiation" via Text

Mert Yuksekgonul, Federico Bianchi, Joseph Boen, Sheng Liu, Zhi Huang, Carlos Guestrin, and James Zou. Textgrad: Automatic” differentiation” via text.arXiv preprint arXiv:2406.07496, 2024. 2, 4, 5, 6, 7

work page internal anchor Pith review Pith/arXiv arXiv 2024
[42]

Transparent image layer diffusion using latent transparency.ACM TOG, 2024

Lvmin Zhang and Maneesh Agrawala. Transparent image layer diffusion using latent transparency.ACM TOG, 2024. 2, 3, 4, 6, 7

work page 2024
[43]

Datasetgan: Efficient labeled data factory with minimal human effort

Yuxuan Zhang, Huan Ling, Jun Gao, Kangxue Yin, Jean- Francois Lafleche, Adela Barriuso, Antonio Torralba, and Sanja Fidler. Datasetgan: Efficient labeled data factory with minimal human effort. InCVPR, 2021. 2, 3

work page 2021
[44]

X-paste: Revisiting scalable copy-paste for instance segmentation using clip and stablediffusion

Hanqing Zhao, Dianmo Sheng, Jianmin Bao, Dongdong Chen, Dong Chen, Fang Wen, Lu Yuan, Ce Liu, Wenbo Zhou, Qi Chu, et al. X-paste: Revisiting scalable copy-paste for instance segmentation using clip and stablediffusion. In ICML, 2023. 2, 4, 5, 6

work page 2023
[45]

high-resolution

Boyuan Zheng, Boyu Gou, Jihyung Kil, Huan Sun, and Yu Su. Gpt-4v(ision) is a generalist web agent, if grounded. In ICML, 2024. 1 10 Gen-n-Val: Agentic Image Data Generation and Validation Supplementary Material Figure S.1. Example of contextual coherence in instance place- ment. The placement of the zebra standing on the huge hot dog is semantically incoh...

work page 2024
[46]

Ensure a wide variety of subjects, ranging from everyday items to unique or uncommon objects

**Subject:** The only single object in the image. Ensure a wide variety of subjects, ranging from everyday items to unique or uncommon objects

work page
[47]

**Status:** The current state or condition of the subject

work page
[48]

Include specific shades and variations to enhance visual detail

**Color:** Dominant colors of the subject. Include specific shades and variations to enhance visual detail

work page
[49]

Incorporate a range of styles (e.g., photorealistic, hyper-realistic) to promote diversity

**Style:** Artistic style or rendering method. Incorporate a range of styles (e.g., photorealistic, hyper-realistic) to promote diversity

work page
[50]

Convey realistic emotions or states that align with the subject

**Mood/Atmosphere:** Emotional quality related to the subject. Convey realistic emotions or states that align with the subject

work page
[51]

Describe natural or artificial lighting conditions that highlight the subject’s features

**Lighting:** Specific lighting on the subject. Describe natural or artificial lighting conditions that highlight the subject’s features

work page
[52]

Use varied viewpoints (e.g., top-down, eye-level, close-up) to add depth

**Perspective/Viewpoint:** Angle or perspective of the subject. Use varied viewpoints (e.g., top-down, eye-level, close-up) to add depth

work page
[53]

Detail the tactile qualities to enhance realism

**Texture/Material:** Textures or materials of the subject. Detail the tactile qualities to enhance realism

work page
[54]

When relevant, specify a realistic time period to provide context

**Time Period:** Specific era. When relevant, specify a realistic time period to provide context

work page
[55]

- **Key Trigger Words:** Include terms like ‘high-resolution’, ’highly realistic’

**Medium:** Artistic medium or level of detail. - **Key Trigger Words:** Include terms like ‘high-resolution’, ’highly realistic’. - **Length:** Keep the prompt under 75 tokens. - **Avoid:** Do not include any additional subjects in the prompt. Do not include any descriptions about the back- ground. Table S.1. Comparison of initial and optimized system pr...

work page
[56]

Single [Category Name]: - Explanation - Result: Meet or Fail

work page
[57]

Single View: - Explanation - Result: Meet or Fail

work page
[58]

Intact [Category Name]: - Explanation - Result: Meet or Fail

work page
[59]

**Instructions:**

Plain Background: - Explanation - Result: Meet or Fail Conclusion: Result: Keep or Filter Out The Data Validation Agent’s Optimized System Prompt You are an AI assistant that analyzes images to determine their suitability based on specific criteria. **Instructions:**

work page
[60]

**Describe the image in detail**

work page
[61]

- **Criteria 2 - Single View**: The subject should be shown from a single angle or perspective

**Evaluate the image** against the following criteria: - **Criteria 1 - Single subject**: The image should contain only one subject. - **Criteria 2 - Single View**: The subject should be shown from a single angle or perspective. - **Criteria 3 - Intact subject**: The subject should be intact and fully visible. - **Criteria 4 - Plain Background**: The back...

work page
[62]

**Decide whether to filter out the image** based on these criteria

work page
[63]

**Output Format:** **Image Description:** [Your detailed description here] **Evaluation Criteria:**

Conclude with **Result:** Keep if the image meets all criteria or **Result:** Filter Out if it violates any criteria. **Output Format:** **Image Description:** [Your detailed description here] **Evaluation Criteria:**

work page
[64]

**Single [Category Name]:** * [Explanation] * **Result:** [Meet/Fail]

work page
[65]

**Single View:** * [Explanation] * **Result:** [Meet/Fail]

work page
[66]

**Intact [Category Name]:** * [Explanation] * **Result:** [Meet/Fail]

work page
[67]

Comparison of the initial and optimized system prompts of the data validation agent

**Plain Background:** * [Explanation] * **Result:** [Meet/Fail] **Conclusion:** [Your conclusion here] **Result:** [Keep/Filter Out] Table S.2. Comparison of the initial and optimized system prompts of the data validation agent. The category name is a placeholder for the specific object category. 6 Standard Layer Diffusion Prompt Foreground Instance Image...

work page 1969
[68]

* **Result:** Fail

**Single orange:** * The image contains multiple oranges, not just one. * **Result:** Fail

work page
[69]

* **Result:** Meet

**Single View:** * The image shows the tree from a single angle, with no multiple views or perspectives. * **Result:** Meet

work page
[70]

* **Result:** Meet

**Intact orange:** * All the oranges in the image appear to be intact and fully visible. * **Result:** Meet

work page
[71]

* **Result:** Meet **Conclusion:** Based on the evaluation criteria, the image fails to meet the first criterion, which requires only one orange

**Plain Background:** * The background of the image is solid black, which is considered plain. * **Result:** Meet **Conclusion:** Based on the evaluation criteria, the image fails to meet the first criterion, which requires only one orange. However, it meets the remaining three criteria, including showing the tree from a single view, having intact oranges...

work page
[72]

* **Result:** Meet

**Single clock:** * The image contains only one clock. * **Result:** Meet

work page
[73]

* **Result:** Meet

**Single View:** * The clock is shown from a single angle or perspective. * **Result:** Meet

work page
[74]

* **Result:** Meet

**Intact clock:** * The clock is intact and fully visible. * **Result:** Meet

work page
[75]

* **Result:** Fail **Conclusion:** The image meets most of the evaluation criteria, but fails to have a plain background due to the presence of additional objects

**Plain Background:** * The background is not entirely plain, as there are other objects present (the plant and lamp). * **Result:** Fail **Conclusion:** The image meets most of the evaluation criteria, but fails to have a plain background due to the presence of additional objects. Therefore, the final answer is: **Result:** Filter Out Figure S.10. Exampl...

work page
[76]

* **Result:** Fail

**Single birthday card:** * The image does not contain a birthday card; instead, it features a candle. * **Result:** Fail

work page
[77]

* **Result:** Meet

**Single View:** * The image shows the candle from a single angle, with no multiple views or perspectives. * **Result:** Meet

work page
[78]

* **Result:** N/A

**Intact birthday card:** * Since there is no birthday card in the image, this criterion is not applicable. * **Result:** N/A

work page
[79]

* **Result:** Meet **Conclusion:** Based on the evaluation criteria, the image fails to meet the first criterion, as it does not contain a birthday card

**Plain Background:** * The background of the image is solid black, with no distracting elements or patterns. * **Result:** Meet **Conclusion:** Based on the evaluation criteria, the image fails to meet the first criterion, as it does not contain a birthday card. How- ever, it meets the remaining criteria, showcasing the candle from a single view and havi...

work page
[80]

* **Result:** Fail

**Single pancake:** * The image does not contain any pancakes. * **Result:** Fail

work page

Showing first 80 references.

[1] [1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ah- mad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774,

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

Label-efficient se- mantic segmentation with diffusion models

Dmitry Baranchuk, Ivan Rubachev, Andrey V oynov, Valentin Khrulkov, and Artem Babenko. Label-efficient se- mantic segmentation with diffusion models. InICCV, 2022. 2, 3, 4

work page 2022

[3] [3]

Large scale gan training for high fidelity natural image synthesis

Andrew Brock, Jeff Donahue, and Karen Simonyan. Large scale gan training for high fidelity natural image synthesis. InICLR, 2019. 3

work page 2019

[4] [4]

Language models are few-shot learners.NeurIPS, 33:1877– 1901, 2020

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Sub- biah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakan- tan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners.NeurIPS, 33:1877– 1901, 2020. 1, 4

work page 1901

[5] [5]

Yolo-world: Real-time open- vocabulary object detection

Tianheng Cheng, Lin Song, Yixiao Ge, Wenyu Liu, Xing- gang Wang, and Ying Shan. Yolo-world: Real-time open- vocabulary object detection. InCVPR, 2024. 8

work page 2024

[6] [6]

Gonzalez, Ion Stoica, and Eric P

Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhang- hao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yong- hao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, 2023. 4

work page 2023

[7] [7]

The Llama 3 Herd of Models

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Ab- hishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783,

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

Mod- eling visual context is key to augmenting object detection datasets

Nikita Dvornik, Julien Mairal, and Cordelia Schmid. Mod- eling visual context is key to augmenting object detection datasets. InECCV, 2018. 2, 3, 7

work page 2018

[9] [9]

Cut, paste and learn: Surprisingly easy synthesis for instance de- tection

Debidatta Dwibedi, Ishan Misra, and Martial Hebert. Cut, paste and learn: Surprisingly easy synthesis for instance de- tection. InICCV, 2017. 2, 3

work page 2017

[10] [10]

Taming transformers for high-resolution image synthesis

Patrick Esser, Robin Rombach, and Bj ¨orn Ommer. Taming transformers for high-resolution image synthesis. InCVPR,

work page

[11] [11]

Instaboost: Boosting instance segmentation via probability map guided copy- pasting

Hao-Shu Fang, Jianhua Sun, Runzhong Wang, Minghao Gou, Yong-Lu Li, and Cewu Lu. Instaboost: Boosting instance segmentation via probability map guided copy- pasting. InICCV, 2019. 2, 3

work page 2019

[12] [12]

Cubuk, Quoc V

Golnaz Ghiasi, Yin Cui, Aravind Srinivas, Rui Qian, Tsung- Yi Lin, Ekin D. Cubuk, Quoc V . Le, and Barret Zoph. Simple copy-paste is a strong data augmentation method for instance segmentation. InCVPR, 2021. 2, 3

work page 2021

[13] [13]

Lvis: A dataset for large vocabulary instance segmentation

Agrim Gupta, Piotr Dollar, and Ross Girshick. Lvis: A dataset for large vocabulary instance segmentation. InCVPR,

work page

[14] [14]

Visual program- ming: Compositional visual reasoning without training

Tanmay Gupta and Aniruddha Kembhavi. Visual program- ming: Compositional visual reasoning without training. In CVPR, pages 14953–14962, 2023. 1, 4

work page 2023

[15] [15]

A style-based generator architecture for generative adversarial networks

Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. In CVPR, 2019. 3

work page 2019

[16] [16]

YOLOv11: An Overview of the Key Architectural Enhancements

Rahima Khanam and Muhammad Hussain. Yolov11: An overview of the key architectural enhancements.arXiv preprint arXiv:2410.17725, 2024. 1, 2, 3, 7, 8

work page internal anchor Pith review Pith/arXiv arXiv 2024

[17] [17]

Big- datasetgan: Synthesizing imagenet with pixel-wise annota- tions

Daiqing Li, Huan Ling, Seung Wook Kim, Karsten Kreis, Adela Barriuso, Sanja Fidler, and Antonio Torralba. Big- datasetgan: Synthesizing imagenet with pixel-wise annota- tions. InCVPR, 2022. 2

work page 2022

[18] [18]

Microsoft coco: Common objects in context

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In ECCV, 2014. 7

work page 2014

[19] [19]

Visual instruction tuning, 2023

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning, 2023. 1, 4

work page 2023

[20] [20]

Llama 3.2: Revolutionizing edge ai and vision with open, customizable models.Meta AI Blog

AI Meta. Llama 3.2: Revolutionizing edge ai and vision with open, customizable models.Meta AI Blog. Retrieved December, 20:2024, 2024. 1, 4, 7

work page 2024

[21] [21]

GPT-4V(ision) system card, 2023

OpenAI. GPT-4V(ision) system card, 2023. 4

work page 2023

[22] [22]

O’Brien, Carrie J

Joon Sung Park, Joseph C. O’Brien, Carrie J. Cai, Mered- ith Ringel Morris, Percy Liang, and Michael S. Bernstein. Generative agents: Interactive simulacra of human behavior. InUIST, New York, NY , USA, 2023. Association for Com- puting Machinery. 1, 4

work page 2023

[23] [23]

SDXL: Improving latent diffusion mod- els for high-resolution image synthesis

Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas M ¨uller, Joe Penna, and Robin Rombach. SDXL: Improving latent diffusion mod- els for high-resolution image synthesis. InICLR, 2024. 2, 4

work page 2024

[24] [24]

Learn- ing transferable visual models from natural language super- vision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- ing transferable visual models from natural language super- vision. InICML, pages 8748–8763. PmLR, 2021. 4

work page 2021

[25] [25]

Hierarchical Text-Conditional Image Generation with CLIP Latents

Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image gen- eration with clip latents.arXiv preprint arXiv:2204.06125,

work page internal anchor Pith review Pith/arXiv arXiv

[26] [26]

High-resolution image syn- thesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj¨orn Ommer. High-resolution image syn- thesis with latent diffusion models. InCVPR, 2022. 2, 4, 5

work page 2022

[27] [27]

Berg, and Li Fei-Fei

Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, San- jeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Bernstein Michael, Alexander C. Berg, and Li Fei-Fei. Imagenet large scale visual recognition challenge

work page

[28] [28]

Toolformer: Lan- guage models can teach themselves to use tools.NeurIPS, 36:68539–68551, 2023

Timo Schick, Jane Dwivedi-Yu, Roberto Dess `ı, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Lan- guage models can teach themselves to use tools.NeurIPS, 36:68539–68551, 2023. 1, 4

work page 2023

[29] [29]

HuggingGPT: Solving ai tasks with chatgpt and its friends in hugging face.NeurIPS, 36:38154–38180, 2023

Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, Weiming Lu, and Yueting Zhuang. HuggingGPT: Solving ai tasks with chatgpt and its friends in hugging face.NeurIPS, 36:38154–38180, 2023. 4

work page 2023

[30] [30]

Reflexion: language agents with verbal reinforcement learning

Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik R Narasimhan, and Shunyu Yao. Reflexion: language agents with verbal reinforcement learning. InNeurIPS, 2023. 1

work page 2023

[31] [31]

Prompt guide, 2024

Adam Stewart. Prompt guide, 2024. 5 9

work page 2024

[32] [32]

Gen2det: Generate to detect.arXiv preprint arXiv:2312.04566, 2023

Saksham Suri, Fanyi Xiao, Animesh Sinha, Sean Chang Cu- latana, Raghuraman Krishnamoorthi, Chenchen Zhu, and Abhinav Shrivastava. Gen2det: Generate to detect.arXiv preprint arXiv:2312.04566, 2023. 2, 4

work page arXiv 2023

[33] [33]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth´ee Lacroix, Baptiste Rozi`ere, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023. 1

work page internal anchor Pith review Pith/arXiv arXiv 2023

[34] [34]

Yolov9: Learning what you want to learn using pro- grammable gradient information

Chien-Yao Wang, I-Hau Yeh, and Hong-Yuan Mark Liao. Yolov9: Learning what you want to learn using pro- grammable gradient information. InECCV, 2025. 1, 2, 3, 7, 8

work page 2025

[35] [35]

Semi-supervised parametric real-world image harmonization

Ke Wang, Micha ¨el Gharbi, He Zhang, Zhihao Xia, and Eli Shechtman. Semi-supervised parametric real-world image harmonization. InCVPR, 2023. 2, 4

work page 2023

[36] [36]

Semi-supervised parametric real-world image harmonization

Ke Wang, Micha ¨el Gharbi, He Zhang, Zhihao Xia, and Eli Shechtman. Semi-supervised parametric real-world image harmonization. InCVPR, 2023. 7

work page 2023

[37] [37]

Hu- manoid agents: Platform for simulating human-like gener- ative agents

Zhilin Wang, Yu Ying Chiu, and Yu Cheung Chiu. Hu- manoid agents: Platform for simulating human-like gener- ative agents. InEMNLP, pages 167–176, Singapore, 2023. Association for Computational Linguistics. 1

work page 2023

[38] [38]

Mosaicfusion: Diffusion models as data augmenters for large vocabulary instance segmentation

Jiahao Xie, Wei Li, Xiangtai Li, Ziwei Liu, Yew Soon Ong, and Chen Change Loy. Mosaicfusion: Diffusion models as data augmenters for large vocabulary instance segmentation. IJCV, 2024. 2, 4, 5, 6, 7

work page 2024

[39] [39]

Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments, 2024

Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh Jing Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, Yitao Liu, Yiheng Xu, Shuyan Zhou, Silvio Savarese, Caiming Xiong, Victor Zhong, and Tao Yu. Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments, 2024. 1

work page 2024

[40] [40]

React: Synergizing rea- soning and acting in language models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing rea- soning and acting in language models. InInternational Con- ference on Learning Representations (ICLR), 2023. 1, 4

work page 2023

[41] [41]

TextGrad: Automatic "Differentiation" via Text

Mert Yuksekgonul, Federico Bianchi, Joseph Boen, Sheng Liu, Zhi Huang, Carlos Guestrin, and James Zou. Textgrad: Automatic” differentiation” via text.arXiv preprint arXiv:2406.07496, 2024. 2, 4, 5, 6, 7

work page internal anchor Pith review Pith/arXiv arXiv 2024

[42] [42]

Transparent image layer diffusion using latent transparency.ACM TOG, 2024

Lvmin Zhang and Maneesh Agrawala. Transparent image layer diffusion using latent transparency.ACM TOG, 2024. 2, 3, 4, 6, 7

work page 2024

[43] [43]

Datasetgan: Efficient labeled data factory with minimal human effort

Yuxuan Zhang, Huan Ling, Jun Gao, Kangxue Yin, Jean- Francois Lafleche, Adela Barriuso, Antonio Torralba, and Sanja Fidler. Datasetgan: Efficient labeled data factory with minimal human effort. InCVPR, 2021. 2, 3

work page 2021

[44] [44]

X-paste: Revisiting scalable copy-paste for instance segmentation using clip and stablediffusion

Hanqing Zhao, Dianmo Sheng, Jianmin Bao, Dongdong Chen, Dong Chen, Fang Wen, Lu Yuan, Ce Liu, Wenbo Zhou, Qi Chu, et al. X-paste: Revisiting scalable copy-paste for instance segmentation using clip and stablediffusion. In ICML, 2023. 2, 4, 5, 6

work page 2023

[45] [45]

high-resolution

Boyuan Zheng, Boyu Gou, Jihyung Kil, Huan Sun, and Yu Su. Gpt-4v(ision) is a generalist web agent, if grounded. In ICML, 2024. 1 10 Gen-n-Val: Agentic Image Data Generation and Validation Supplementary Material Figure S.1. Example of contextual coherence in instance place- ment. The placement of the zebra standing on the huge hot dog is semantically incoh...

work page 2024

[46] [46]

Ensure a wide variety of subjects, ranging from everyday items to unique or uncommon objects

**Subject:** The only single object in the image. Ensure a wide variety of subjects, ranging from everyday items to unique or uncommon objects

work page

[47] [47]

**Status:** The current state or condition of the subject

work page

[48] [48]

Include specific shades and variations to enhance visual detail

**Color:** Dominant colors of the subject. Include specific shades and variations to enhance visual detail

work page

[49] [49]

Incorporate a range of styles (e.g., photorealistic, hyper-realistic) to promote diversity

**Style:** Artistic style or rendering method. Incorporate a range of styles (e.g., photorealistic, hyper-realistic) to promote diversity

work page

[50] [50]

Convey realistic emotions or states that align with the subject

**Mood/Atmosphere:** Emotional quality related to the subject. Convey realistic emotions or states that align with the subject

work page

[51] [51]

Describe natural or artificial lighting conditions that highlight the subject’s features

**Lighting:** Specific lighting on the subject. Describe natural or artificial lighting conditions that highlight the subject’s features

work page

[52] [52]

Use varied viewpoints (e.g., top-down, eye-level, close-up) to add depth

**Perspective/Viewpoint:** Angle or perspective of the subject. Use varied viewpoints (e.g., top-down, eye-level, close-up) to add depth

work page

[53] [53]

Detail the tactile qualities to enhance realism

**Texture/Material:** Textures or materials of the subject. Detail the tactile qualities to enhance realism

work page

[54] [54]

When relevant, specify a realistic time period to provide context

**Time Period:** Specific era. When relevant, specify a realistic time period to provide context

work page

[55] [55]

- **Key Trigger Words:** Include terms like ‘high-resolution’, ’highly realistic’

**Medium:** Artistic medium or level of detail. - **Key Trigger Words:** Include terms like ‘high-resolution’, ’highly realistic’. - **Length:** Keep the prompt under 75 tokens. - **Avoid:** Do not include any additional subjects in the prompt. Do not include any descriptions about the back- ground. Table S.1. Comparison of initial and optimized system pr...

work page

[56] [56]

Single [Category Name]: - Explanation - Result: Meet or Fail

work page

[57] [57]

Single View: - Explanation - Result: Meet or Fail

work page

[58] [58]

Intact [Category Name]: - Explanation - Result: Meet or Fail

work page

[59] [59]

**Instructions:**

Plain Background: - Explanation - Result: Meet or Fail Conclusion: Result: Keep or Filter Out The Data Validation Agent’s Optimized System Prompt You are an AI assistant that analyzes images to determine their suitability based on specific criteria. **Instructions:**

work page

[60] [60]

**Describe the image in detail**

work page

[61] [61]

- **Criteria 2 - Single View**: The subject should be shown from a single angle or perspective

**Evaluate the image** against the following criteria: - **Criteria 1 - Single subject**: The image should contain only one subject. - **Criteria 2 - Single View**: The subject should be shown from a single angle or perspective. - **Criteria 3 - Intact subject**: The subject should be intact and fully visible. - **Criteria 4 - Plain Background**: The back...

work page

[62] [62]

**Decide whether to filter out the image** based on these criteria

work page

[63] [63]

**Output Format:** **Image Description:** [Your detailed description here] **Evaluation Criteria:**

Conclude with **Result:** Keep if the image meets all criteria or **Result:** Filter Out if it violates any criteria. **Output Format:** **Image Description:** [Your detailed description here] **Evaluation Criteria:**

work page

[64] [64]

**Single [Category Name]:** * [Explanation] * **Result:** [Meet/Fail]

work page

[65] [65]

**Single View:** * [Explanation] * **Result:** [Meet/Fail]

work page

[66] [66]

**Intact [Category Name]:** * [Explanation] * **Result:** [Meet/Fail]

work page

[67] [67]

Comparison of the initial and optimized system prompts of the data validation agent

**Plain Background:** * [Explanation] * **Result:** [Meet/Fail] **Conclusion:** [Your conclusion here] **Result:** [Keep/Filter Out] Table S.2. Comparison of the initial and optimized system prompts of the data validation agent. The category name is a placeholder for the specific object category. 6 Standard Layer Diffusion Prompt Foreground Instance Image...

work page 1969

[68] [68]

* **Result:** Fail

**Single orange:** * The image contains multiple oranges, not just one. * **Result:** Fail

work page

[69] [69]

* **Result:** Meet

**Single View:** * The image shows the tree from a single angle, with no multiple views or perspectives. * **Result:** Meet

work page

[70] [70]

* **Result:** Meet

**Intact orange:** * All the oranges in the image appear to be intact and fully visible. * **Result:** Meet

work page

[71] [71]

* **Result:** Meet **Conclusion:** Based on the evaluation criteria, the image fails to meet the first criterion, which requires only one orange

**Plain Background:** * The background of the image is solid black, which is considered plain. * **Result:** Meet **Conclusion:** Based on the evaluation criteria, the image fails to meet the first criterion, which requires only one orange. However, it meets the remaining three criteria, including showing the tree from a single view, having intact oranges...

work page

[72] [72]

* **Result:** Meet

**Single clock:** * The image contains only one clock. * **Result:** Meet

work page

[73] [73]

* **Result:** Meet

**Single View:** * The clock is shown from a single angle or perspective. * **Result:** Meet

work page

[74] [74]

* **Result:** Meet

**Intact clock:** * The clock is intact and fully visible. * **Result:** Meet

work page

[75] [75]

* **Result:** Fail **Conclusion:** The image meets most of the evaluation criteria, but fails to have a plain background due to the presence of additional objects

**Plain Background:** * The background is not entirely plain, as there are other objects present (the plant and lamp). * **Result:** Fail **Conclusion:** The image meets most of the evaluation criteria, but fails to have a plain background due to the presence of additional objects. Therefore, the final answer is: **Result:** Filter Out Figure S.10. Exampl...

work page

[76] [76]

* **Result:** Fail

**Single birthday card:** * The image does not contain a birthday card; instead, it features a candle. * **Result:** Fail

work page

[77] [77]

* **Result:** Meet

**Single View:** * The image shows the candle from a single angle, with no multiple views or perspectives. * **Result:** Meet

work page

[78] [78]

* **Result:** N/A

**Intact birthday card:** * Since there is no birthday card in the image, this criterion is not applicable. * **Result:** N/A

work page

[79] [79]

* **Result:** Meet **Conclusion:** Based on the evaluation criteria, the image fails to meet the first criterion, as it does not contain a birthday card

**Plain Background:** * The background of the image is solid black, with no distracting elements or patterns. * **Result:** Meet **Conclusion:** Based on the evaluation criteria, the image fails to meet the first criterion, as it does not contain a birthday card. How- ever, it meets the remaining criteria, showcasing the candle from a single view and havi...

work page

[80] [80]

* **Result:** Fail

**Single pancake:** * The image does not contain any pancakes. * **Result:** Fail

work page