Shape2Animal: Creative Animal Generation from Natural Silhouettes

Anh-Tuan Vo; Dinh-Khoi Vo; Minh-Triet Tran; Quoc-Duy Tran; Tam V. Nguyen; Trung-Nghia Le

arxiv: 2506.20616 · v4 · submitted 2025-06-25 · 💻 cs.CV

Shape2Animal: Creative Animal Generation from Natural Silhouettes

Quoc-Duy Tran , Anh-Tuan Vo , Dinh-Khoi Vo , Tam V. Nguyen , Minh-Triet Tran , Trung-Nghia Le This is my paper

Pith reviewed 2026-05-19 07:22 UTC · model grok-4.3

classification 💻 cs.CV

keywords shape to animalpareidoliasilhouette interpretationdiffusion modelsvision-language modelsimage synthesisscene blendingcreative generation

0 comments

The pith

Shape2Animal automates the reinterpretation of natural silhouettes like clouds or stones as fitting animal images through segmentation and generative synthesis.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a framework that extracts object silhouettes from everyday scenes and converts them into plausible animals. It begins with open-vocabulary segmentation to isolate shapes, then uses vision-language models to select matching animal ideas. A text-to-image diffusion model creates an animal that follows the extracted outline, after which the result is blended back into the original image. If successful, this process yields coherent compositions that mimic how humans see animals in ambiguous forms, opening paths for creative image tasks.

Core claim

The Shape2Animal framework first performs open-vocabulary segmentation to extract object silhouettes and interprets semantically appropriate animal concepts using vision-language models. It then synthesizes an animal image that conforms to the input shape, leveraging text-to-image diffusion models and seamlessly blends it into the original scene to generate visually coherent and spatially consistent compositions.

What carries the argument

The end-to-end pipeline that combines open-vocabulary segmentation for shape extraction, vision-language model concept selection, diffusion-based animal synthesis, and scene blending.

If this is right

The framework produces visually coherent and spatially consistent animal compositions from diverse real-world inputs.
It demonstrates robustness when applied to varied natural silhouettes such as clouds, stones, or flames.
It supports new uses in visual storytelling, educational content, digital art, and interactive media design.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar pipelines could be adapted to reinterpret silhouettes as non-animal objects such as vehicles or plants.
The approach suggests a route for testing how generative models handle shape constraints across different input domains.
Integration with user-provided prompts might allow controlled variation in the chosen animal while preserving the silhouette match.

Load-bearing premise

Vision-language models can reliably pick animal concepts that match extracted silhouettes and diffusion models can generate animals whose shapes align closely enough for seamless blending.

What would settle it

Running the system on a clear silhouette such as a cloud outline and observing that the generated animal body deviates substantially from the input contour or shows obvious blending artifacts in the final image.

Figures

Figures reproduced from arXiv: 2506.20616 by Anh-Tuan Vo, Dinh-Khoi Vo, Minh-Triet Tran, Quoc-Duy Tran, Tam V. Nguyen, Trung-Nghia Le.

**Figure 1.** Figure 1: Shape2Animal emulates the human ability of pareidolia, reimagining natural silhouettes as animals by leveraging [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗

**Figure 2.** Figure 2: Overview of the Shape2Animal framework. Given a natural object image, the system extracts a salient silhouette, generates [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Pipeline of silhouette segmentation module for extracting [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

**Figure 4.** Figure 4: Concept interpreted from silhouette using the Gemini [PITH_FULL_IMAGE:figures/full_fig_p003_4.png] view at source ↗

**Figure 5.** Figure 5: Qualitative results of Shape2Animal. plausible. In Task 3 – Visual Evaluation, participants rated the aesthetic quality of the final generated animal images. The evaluation focused on two key dimensions: Conceptual Plausibility and Visual Aesthetic. Conceptual Plausibility was measured through the agreement ratios between participant guesses and AI-generated labels in the first two tasks. Visual Aestheti… view at source ↗

**Figure 6.** Figure 6: Shape2Animal applied on leaves. 4.4.2 Limitations Shape2Animal inherits limitations from the foundation models it builds upon. Segmentation failures can occur when SAM [11] struggles with ambiguous or low-contrast objects, leading to incomplete or inaccurate masks . During concept grounding, Gemini [7] may generate implausible animal suggestions when faced with highly abstract shapes. In the synthesis st… view at source ↗

**Figure 7.** Figure 7: Failure cases of Shape2Animal. ACKNOWLEDGMENTS This research is funded by Vietnam National Foundation for Science and Technology Development (NAFOSTED) under Grant Number 102.05-2023.31. REFERENCES [1] J.-B. Alayrac, J. Donahue, P. Luc, A. Miech, I. Barr, Y. Hasson, M. Azar, A. Radford, J. Han, J. Huang, et al. Flamingo: a visual language model for few-shot learning, 2022. arXiv preprint arXiv:2204.14198… view at source ↗

read the original abstract

Humans possess a unique ability to perceive meaningful patterns in ambiguous stimuli, a cognitive phenomenon known as pareidolia. This paper introduces Shape2Animal framework to mimics this imaginative capacity by reinterpreting natural object silhouettes, such as clouds, stones, or flames, as plausible animal forms. Our automated framework first performs open-vocabulary segmentation to extract object silhouette and interprets semantically appropriate animal concepts using vision-language models. It then synthesizes an animal image that conforms to the input shape, leveraging text-to-image diffusion model and seamlessly blends it into the original scene to generate visually coherent and spatially consistent compositions. We evaluated Shape2Animal on a diverse set of real-world inputs, demonstrating its robustness and creative potential. Our Shape2Animal can offer new opportunities for visual storytelling, educational content, digital art, and interactive media design. Our project page is here: https://shape2image.github.io

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Shape2Animal is a straightforward pipeline chaining segmentation, VLMs, and diffusion for turning silhouettes into animals, but it needs real metrics to show the steps actually work.

read the letter

The main thing to know is that this paper describes an end-to-end system for creative generation: it segments a natural silhouette with open-vocabulary tools, picks an animal concept via VLM, synthesizes a shape-matching animal with diffusion, and blends it into the scene. The goal is to mimic pareidolia on inputs like clouds, stones, or flames for art and education uses. The specific pipeline for this task looks new relative to the cited work, even if the pieces are standard components. What it does well is lay out a coherent, practical flow that directly applies existing models without overcomplicating the architecture, and the project page implies they have visual results to illustrate the outputs. The application has clear downstream value for digital storytelling or interactive media. The soft spots are in the evidence. The abstract and description give no numbers, no ablations, no baseline comparisons, and no breakdown of failure cases, so claims of robustness rest on unshown tests. The stress-test concern about segmentation failing on amorphous or low-contrast shapes like flames or clouds is worth taking seriously; if the first step produces fragmented masks, everything after it is compromised. Without seeing detailed experiments in the full text, it's unclear how often the shape conformity and seamless blending actually succeed. This work is for people building generative tools in computer vision or creative AI rather than those chasing theoretical advances. A reader looking for ideas on combining these models for shape-guided tasks could pick up useful implementation pointers. It shows honest engagement with the components and literature, so it qualifies as serious thinking on its own terms. I would bring it to a reading group to discuss the practical pipeline. Send it to peer review so referees can push for the missing quantitative checks and edge-case testing.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces Shape2Animal, a framework that mimics pareidolia by reinterpreting natural silhouettes (clouds, stones, flames) as animals. The pipeline extracts silhouettes via open-vocabulary segmentation, selects animal concepts with vision-language models, synthesizes shape-conforming animals using text-to-image diffusion, and blends the result back into the original scene for coherent compositions. It claims robustness and creative potential on diverse real-world inputs for applications in storytelling and digital art.

Significance. If the pipeline functions as described, it could enable accessible creative tools for visual content generation. However, the absence of quantitative metrics, ablation studies, or failure analyses in the manuscript makes it difficult to determine whether the approach advances beyond existing segmentation-plus-diffusion pipelines or delivers reliable shape conformity and blending.

major comments (2)

[Abstract] Abstract: the claim that the framework 'demonstrates its robustness' rests on an unevaluated pipeline; no metrics, datasets, baselines, or ablation results are reported to support robustness or creative potential.
[Method] Method (open-vocabulary segmentation stage): the framework assumes reliable silhouette extraction for amorphous inputs, yet standard open-vocabulary segmenters are known to produce fragmented or empty masks on low-contrast, non-rigid phenomena such as clouds or flames; this precondition is load-bearing for all downstream VLM selection and diffusion synthesis steps.

minor comments (1)

[Abstract] Abstract: the project page URL is given but no code, model checkpoints, or implementation details are referenced in the text.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the detailed review of our manuscript. We appreciate the referee's identification of key areas where the presentation of evaluation and methodological assumptions can be strengthened. We respond to each major comment below and outline the revisions we will incorporate.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that the framework 'demonstrates its robustness' rests on an unevaluated pipeline; no metrics, datasets, baselines, or ablation results are reported to support robustness or creative potential.

Authors: We agree that the current manuscript supports its claims primarily through qualitative visual results on a collection of real-world silhouettes rather than quantitative metrics or controlled ablations. The evaluation section describes a diverse set of inputs (clouds, stones, flames, etc.) and shows the resulting compositions, but does not report numerical scores, baselines, or failure-rate statistics. We will revise the abstract to replace the phrase 'demonstrating its robustness' with a more precise statement such as 'illustrating its applicability to diverse real-world inputs.' We will also expand the experiments section to explicitly describe the input curation process, the qualitative assessment criteria used, and a short failure-case analysis. revision: yes
Referee: [Method] Method (open-vocabulary segmentation stage): the framework assumes reliable silhouette extraction for amorphous inputs, yet standard open-vocabulary segmenters are known to produce fragmented or empty masks on low-contrast, non-rigid phenomena such as clouds or flames; this precondition is load-bearing for all downstream VLM selection and diffusion synthesis steps.

Authors: This is a fair and important observation. The pipeline does depend on obtaining usable silhouettes, and open-vocabulary models can indeed return incomplete or noisy masks for low-contrast, non-rigid shapes. In our current implementation we rely on prompt engineering and simple morphological cleanup to improve mask quality, and we select demonstration examples where the extracted silhouette is sufficiently coherent. We will add a dedicated paragraph in the method section that acknowledges this assumption, describes the specific segmentation model and prompting strategy employed, and includes representative failure examples together with the downstream effects on animal synthesis. revision: yes

Circularity Check

0 steps flagged

No circularity: descriptive engineering pipeline with no derivations or fitted predictions

full rationale

The paper describes an applied framework that chains existing tools (open-vocabulary segmentation, VLMs, diffusion models, blending) to reinterpret silhouettes as animals. No equations, parameter fitting, or first-principles derivations appear in the provided text or abstract. All steps are procedural engineering choices rather than claims that reduce to their own inputs by construction. No self-citation load-bearing arguments or uniqueness theorems are invoked. The contribution is therefore self-contained against external benchmarks and receives the default non-finding of zero circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an applied engineering paper in computer vision. No free parameters are fitted, no new axioms are stated, and no new physical or mathematical entities are postulated beyond reuse of existing models.

pith-pipeline@v0.9.0 · 5700 in / 1150 out tokens · 50757 ms · 2026-05-19T07:22:17.487304+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Our automated framework first performs open-vocabulary segmentation to extract object silhouette and interprets semantically appropriate animal concepts using vision-language models. It then synthesizes an animal image that conforms to the input shape, leveraging text-to-image diffusion model

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

27 extracted references · 27 canonical work pages · 4 internal anchors

[1]

Flamingo: a Visual Language Model for Few-Shot Learning

J.-B. Alayrac, J. Donahue, P. Luc, A. Miech, I. Barr, Y . Hasson, M. Azar, A. Radford, J. Han, J. Huang, et al. Flamingo: a vi- sual language model for few-shot learning, 2022. arXiv preprint arXiv:2204.14198. 2

work page internal anchor Pith review Pith/arXiv arXiv 2022
[2]

Bednarik

R. Bednarik. Collective pareidolia. Qeios, 2024. 1

work page 2024
[3]

Midas v3

R. Birkl, D. Wofk, and M. M ¨uller. Midas v3.1 – a model zoo for robust monocular relative depth estimation. arXiv preprint arXiv:2307.14460, 2023. 2

work page arXiv 2023
[4]

W. Doniger. Animals and the Human Imagination: A Companion to Animal Studies. Columbia University Press, 2012. 1

work page 2012
[5]

W. Doniger. Animals and the Human Imagination: A Companion to Animal Studies. Columbia University Press, 2012. Accessed April

work page 2012
[6]

I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y . Bengio. Generative adversarial net- works, 2014. 2

work page 2014
[7]

Gemini API: gemini-2.5-flash-preview-04-17

Google. Gemini API: gemini-2.5-flash-preview-04-17. https://ai.google.dev/gemini-api/docs/models# gemini-2.5-flash-preview , April 2025. Accessed April

work page 2025
[8]

ControlNet for Stable Diffusion XL: controlnet- depth-sdxl-1.0-small

Hugging Face. ControlNet for Stable Diffusion XL: controlnet- depth-sdxl-1.0-small. https://huggingface.co/diffusers/ controlnet-depth-sdxl-1.0-small , 2023. Accessed April

work page 2023
[9]

Isola, J.-Y

P. Isola, J.-Y . Zhu, T. Zhou, and A. A. Efros. Image-to-image transla- tion with conditional adversarial networks, 2018. 1

work page 2018
[10]

D. P. Kingma and M. Welling. Auto-encoding variational bayes, 2022. 2

work page 2022
[11]

Segment Anything

A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y . Lo, P. Doll ´ar, and R. Gir- shick. Segment anything. arXiv preprint arXiv:2304.02643 , 2023. Accessed April 2025. 2, 4

work page internal anchor Pith review Pith/arXiv arXiv 2023
[12]

J. Liu, J. Li, L. Feng, L. Li, J. Tian, and K. Lee. Seeing jesus in toast: Neural and behavioral correlates of face pareidolia.Cortex, 53:60–77,

work page
[13]

Martinez-Conde

S. Martinez-Conde. The brain sees faces everywhere. Scientific Amer- ican, Sep 2015. 1

work page 2015
[14]

Martinez-Conde

S. Martinez-Conde. The brain sees faces everywhere. Scientific Amer- ican, Sep. 2015. 2

work page 2015
[15]

Mou et al

L. Mou et al. T2i-adapter: Learning adapters to dig out more control- lable ability for text-to-image diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 12345–12354, 2023. 2

work page 2023
[16]

Podell, Z

D. Podell, Z. English, K. Lacey, A. Blattmann, T. Dockhorn, J. M¨uller, J. Penna, and R. Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis, 2023. 1

work page 2023
[17]

Radford, J

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever. Learning transferable visual models from natural language supervi- sion. In Proceedings of the 38th International Conference on Machine Learning, vol. 139, pp. 8748–8763. PMLR, 2021. 2

work page 2021
[18]

V . S. Ramachandran and W. Hirstein. The science of art: A neurolog- ical theory of aesthetic experience. Journal of Consciousness Studies, 6(6–7):15–51, 1999. 1

work page 1999
[19]

Hierarchical Text-Conditional Image Generation with CLIP Latents

A. Ramesh, P. Dhariwal, A. Nichol, C. Chu, and M. Chen. Hierarchi- cal text-conditional image generation with clip latents, 2022. arXiv preprint arXiv:2204.06125. 2

work page internal anchor Pith review Pith/arXiv arXiv 2022
[20]

Rombach, A

R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer. High- resolution image synthesis with latent diffusion models. In Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 10684–10695, June 2022. 2, 4

work page 2022
[21]

Saharia, W

C. Saharia, W. Chan, S. Saxena, L. Li, J. Whang, E. Denton, S. K. S. Ghasemipour, B. K. Ayan, S. S. Mahdavi, R. G. Lopes, T. Salimans, J. Ho, D. J. Fleet, and M. Norouzi. Photorealistic text-to-image diffu- sion models with deep language understanding, 2022. 1, 2

work page 2022
[22]

A. Somaini. On the altered states of machine vision. AN-ICON. Stud- ies in Environmental Images [ISSN 2785-7433], 2022. 1

work page 2022
[23]

Y . Yang, T. Zhang, G. Li, T. Kim, and G. Wang. An unsupervised domain adaptation model based on dual-module adversarial training. Neurocomputing, 475:102–111, Feb. 2022. doi: 10.1016/j.neucom. 2021.12.060 1

work page doi:10.1016/j.neucom 2022
[24]

Z. Yang, J. Wang, Z. Gan, L. Li, K. Lin, C. Wu, N. Duan, Z. Liu, C. Liu, M. Zeng, and L. Wang. Reco: Region-controlled text-to-image generation, 2022. 1, 2

work page 2022
[25]

Zhang, A

L. Zhang, A. Rao, and M. Agrawala. Adding conditional control to text-to-image diffusion models, 2023. 1, 2

work page 2023
[26]

Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection

S. Zhang, X. Chen, X. Wang, Z. Liu, S. Liu, M. Li, and P. Luo. Grounding dino: Marrying dino with grounded pre-training for open- set object detection. arXiv preprint arXiv:2303.05499 , 2023. Ac- cessed April 2025. 2

work page internal anchor Pith review Pith/arXiv arXiv 2023
[27]

J.-Y . Zhu, T. Park, P. Isola, and A. A. Efros. Unpaired image-to-image translation using cycle-consistent adversarial networks, 2020. 1

work page 2020

[1] [1]

Flamingo: a Visual Language Model for Few-Shot Learning

J.-B. Alayrac, J. Donahue, P. Luc, A. Miech, I. Barr, Y . Hasson, M. Azar, A. Radford, J. Han, J. Huang, et al. Flamingo: a vi- sual language model for few-shot learning, 2022. arXiv preprint arXiv:2204.14198. 2

work page internal anchor Pith review Pith/arXiv arXiv 2022

[2] [2]

Bednarik

R. Bednarik. Collective pareidolia. Qeios, 2024. 1

work page 2024

[3] [3]

Midas v3

R. Birkl, D. Wofk, and M. M ¨uller. Midas v3.1 – a model zoo for robust monocular relative depth estimation. arXiv preprint arXiv:2307.14460, 2023. 2

work page arXiv 2023

[4] [4]

W. Doniger. Animals and the Human Imagination: A Companion to Animal Studies. Columbia University Press, 2012. 1

work page 2012

[5] [5]

W. Doniger. Animals and the Human Imagination: A Companion to Animal Studies. Columbia University Press, 2012. Accessed April

work page 2012

[6] [6]

I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y . Bengio. Generative adversarial net- works, 2014. 2

work page 2014

[7] [7]

Gemini API: gemini-2.5-flash-preview-04-17

Google. Gemini API: gemini-2.5-flash-preview-04-17. https://ai.google.dev/gemini-api/docs/models# gemini-2.5-flash-preview , April 2025. Accessed April

work page 2025

[8] [8]

ControlNet for Stable Diffusion XL: controlnet- depth-sdxl-1.0-small

Hugging Face. ControlNet for Stable Diffusion XL: controlnet- depth-sdxl-1.0-small. https://huggingface.co/diffusers/ controlnet-depth-sdxl-1.0-small , 2023. Accessed April

work page 2023

[9] [9]

Isola, J.-Y

P. Isola, J.-Y . Zhu, T. Zhou, and A. A. Efros. Image-to-image transla- tion with conditional adversarial networks, 2018. 1

work page 2018

[10] [10]

D. P. Kingma and M. Welling. Auto-encoding variational bayes, 2022. 2

work page 2022

[11] [11]

Segment Anything

A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y . Lo, P. Doll ´ar, and R. Gir- shick. Segment anything. arXiv preprint arXiv:2304.02643 , 2023. Accessed April 2025. 2, 4

work page internal anchor Pith review Pith/arXiv arXiv 2023

[12] [12]

J. Liu, J. Li, L. Feng, L. Li, J. Tian, and K. Lee. Seeing jesus in toast: Neural and behavioral correlates of face pareidolia.Cortex, 53:60–77,

work page

[13] [13]

Martinez-Conde

S. Martinez-Conde. The brain sees faces everywhere. Scientific Amer- ican, Sep 2015. 1

work page 2015

[14] [14]

Martinez-Conde

S. Martinez-Conde. The brain sees faces everywhere. Scientific Amer- ican, Sep. 2015. 2

work page 2015

[15] [15]

Mou et al

L. Mou et al. T2i-adapter: Learning adapters to dig out more control- lable ability for text-to-image diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 12345–12354, 2023. 2

work page 2023

[16] [16]

Podell, Z

D. Podell, Z. English, K. Lacey, A. Blattmann, T. Dockhorn, J. M¨uller, J. Penna, and R. Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis, 2023. 1

work page 2023

[17] [17]

Radford, J

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever. Learning transferable visual models from natural language supervi- sion. In Proceedings of the 38th International Conference on Machine Learning, vol. 139, pp. 8748–8763. PMLR, 2021. 2

work page 2021

[18] [18]

V . S. Ramachandran and W. Hirstein. The science of art: A neurolog- ical theory of aesthetic experience. Journal of Consciousness Studies, 6(6–7):15–51, 1999. 1

work page 1999

[19] [19]

Hierarchical Text-Conditional Image Generation with CLIP Latents

A. Ramesh, P. Dhariwal, A. Nichol, C. Chu, and M. Chen. Hierarchi- cal text-conditional image generation with clip latents, 2022. arXiv preprint arXiv:2204.06125. 2

work page internal anchor Pith review Pith/arXiv arXiv 2022

[20] [20]

Rombach, A

R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer. High- resolution image synthesis with latent diffusion models. In Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 10684–10695, June 2022. 2, 4

work page 2022

[21] [21]

Saharia, W

C. Saharia, W. Chan, S. Saxena, L. Li, J. Whang, E. Denton, S. K. S. Ghasemipour, B. K. Ayan, S. S. Mahdavi, R. G. Lopes, T. Salimans, J. Ho, D. J. Fleet, and M. Norouzi. Photorealistic text-to-image diffu- sion models with deep language understanding, 2022. 1, 2

work page 2022

[22] [22]

A. Somaini. On the altered states of machine vision. AN-ICON. Stud- ies in Environmental Images [ISSN 2785-7433], 2022. 1

work page 2022

[23] [23]

Y . Yang, T. Zhang, G. Li, T. Kim, and G. Wang. An unsupervised domain adaptation model based on dual-module adversarial training. Neurocomputing, 475:102–111, Feb. 2022. doi: 10.1016/j.neucom. 2021.12.060 1

work page doi:10.1016/j.neucom 2022

[24] [24]

Z. Yang, J. Wang, Z. Gan, L. Li, K. Lin, C. Wu, N. Duan, Z. Liu, C. Liu, M. Zeng, and L. Wang. Reco: Region-controlled text-to-image generation, 2022. 1, 2

work page 2022

[25] [25]

Zhang, A

L. Zhang, A. Rao, and M. Agrawala. Adding conditional control to text-to-image diffusion models, 2023. 1, 2

work page 2023

[26] [26]

Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection

S. Zhang, X. Chen, X. Wang, Z. Liu, S. Liu, M. Li, and P. Luo. Grounding dino: Marrying dino with grounded pre-training for open- set object detection. arXiv preprint arXiv:2303.05499 , 2023. Ac- cessed April 2025. 2

work page internal anchor Pith review Pith/arXiv arXiv 2023

[27] [27]

J.-Y . Zhu, T. Park, P. Isola, and A. A. Efros. Unpaired image-to-image translation using cycle-consistent adversarial networks, 2020. 1

work page 2020