Semantic Browsing: Controllable Diversity for Image Generation

Daniel Cohen-Or; Maya Vishnevsky; Omer Dahary; Or Patashnik; Sara Dorfman

arxiv: 2606.23679 · v1 · pith:IS5GUHZVnew · submitted 2026-06-22 · 💻 cs.CV · cs.AI· cs.GR· cs.LG

Semantic Browsing: Controllable Diversity for Image Generation

Sara Dorfman , Maya Vishnevsky , Omer Dahary , Or Patashnik , Daniel Cohen-Or This is my paper

Pith reviewed 2026-06-26 09:01 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.GRcs.LG

keywords semantic diversitytext-to-image generationcontrolled variationvision language modelagentic workflowimage browsing

0 comments

The pith

A method induces controllable diversity in text-to-image outputs by generating structured semantic variations directly in text prompts via a vision-language model.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Modern text-to-image models achieve high fidelity but collapse outputs to one visual reading of a prompt. The paper addresses this by shifting diversity generation from stochastic image sampling to deliberate text-level changes. It employs a vision-language model inside an agentic workflow that reads the full scene context and produces prompt variations aligned with the original intent. The result is a set of images that differ along explicit, human-interpretable semantic axes rather than incidental noise. Users can therefore browse galleries by traversing those axes as design decisions.

Core claim

By exploiting the separation between semantic planning and pixel synthesis in models trained on elaborated captions, an agentic VLM workflow can generate prompt variants that enforce structured, meaningful diversity; every resulting image then corresponds to one understandable semantic choice.

What carries the argument

An agentic workflow that uses a Vision Language Model on full scene context to enforce structured textual variations attuned to the original prompt.

If this is right

Image galleries become systematically navigable along explicit semantic dimensions.
Creative exploration proceeds through user-understandable design choices instead of random sampling.
Diversity control moves upstream to the text representation and no longer depends on the image model's internal stochasticity.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same text-level control could be tested on video or 3D generators that also rely on rich captions.
Interactive interfaces could expose the semantic axes directly to designers for iterative selection.
The approach might reduce dependence on multiple random seeds in production pipelines.

Load-bearing premise

Text-to-image models trained on elaborated captions have already decoupled semantic decision-making from pixel generation.

What would settle it

A user study or automated check in which the generated image sets show no consistent, human-recognizable semantic differences along the claimed axes, or in which the variations revert to incidental pixel noise.

Figures

Figures reproduced from arXiv: 2606.23679 by Daniel Cohen-Or, Maya Vishnevsky, Omer Dahary, Or Patashnik, Sara Dorfman.

**Figure 1.** Figure 1: Semantic Browsing for Image Generation. From a single text prompt “A poster featuring animals”, the system produces a structured gallery of images that explore different meaningful interpretations of the same scene. Rather than random variations, each image reflects a distinct, coherent semantic choice (e.g., changes in character, composition, or style) allowing users to browse a space of alternatives in a… view at source ↗

**Figure 2.** Figure 2: Diversity Collapse in Standard Sampling. Visual comparison for the prompt: “A clown and a princess holding a wand.” While simply changing the random seed (consecutive seeds 0-3 shown in bottom row) results in repetitive layouts [Dahary et al. 2025] and limited variation, our method (top row) achieves significant structural and semantic diversity. or lighting), or contextual elements (e.g., weather or backg… view at source ↗

**Figure 3.** Figure 3: Overview of the iterative generation flow. A user prompt is transformed into a structured JSON format which is iteratively modified by a Multi-Agent workflow. This process creates structured diversity of JSON variations that remain faithful to the initial user intent, driving the generator to produce perceptually distinct images. explores creative variations within the semantic space itself to organize al… view at source ↗

**Figure 4.** Figure 4: Example of semantic browsing produced by our method. Starting from an initial scene interpretation inferred from the user prompt, the method explores alternative realizations by committing explicit semantic constraints at each step. Each branching point corresponds to alternative realizations of a single semantic aspect, while previously fixed constraints are preserved. Branching points also include an opt… view at source ↗

**Figure 5.** Figure 5: Multi-Agent workflow guiding an iterative JSON generation process. The pipeline takes the current JSON configuration and a history of constraints derived from previous modifications (including the user prompt) as inputs. A sequence of agents—Context Analyst, Brainstormer, Decision Maker, and Critic—analyzes these inputs to select an aspect to modify and formulate specific instructions. The JSON Refiner the… view at source ↗

**Figure 6.** Figure 6: Example of interactive semantic browsing. At each node, users may either commit to a new realization of the selected semantic aspect and continue refining that interpretation (green), or preserve the current realization and explore other semantic aspects from the same state (orange). All nodes correspond to valid intermediate states that can be further expanded. to manually select any node of interest to … view at source ↗

**Figure 7.** Figure 7: Structured diversity results. All images shown are derived from a single initial scene. The outer gray groupings organize results that share a direct common ancestor scene. Inside, the colored boxes distinguish sibling branches (parallel variations that share the same parent but differ from one another by a single semantic aspect). This demonstrates how our method introduces meaningful diversity while pres… view at source ↗

**Figure 8.** Figure 8: Qualitative comparison on the prompt: “A glass bowl contains peeled tangerines and cut strawberries.” Columns 2 and 5-7 report results using consecutive seeds with hyperparameters optimized for diversity. Columns 3-4 display the most diverse subset of four images selected from a larger candidate pool. While baseline methods exhibit limited variation, our method (column 1) successfully presents distinct and… view at source ↗

**Figure 9.** Figure 9: Model-Agnostic Generation (FLUX.2). Qualitative results demonstrating the transferability of our framework to the FLUX.2 architecture. By utilizing our agentic flow solely for scene generation and FLUX.2 as the rendering backbone, we achieve consistent structured diversity [PITH_FULL_IMAGE:figures/full_fig_p009_9.png] view at source ↗

**Figure 10.** Figure 10: Human Preference Study. Our method (Semantic Browsing) dominates in Diversity across all comparisons while consistently outperforming baselines in Overall Preference. distance from 0.362 (unified) to 0.389 (separated), representing a 7.2% relative improvement in overall diversity [PITH_FULL_IMAGE:figures/full_fig_p009_10.png] view at source ↗

**Figure 11.** Figure 11: Semantic-Topological Correlation. Box plot showing the distribution of Pairwise DINO Distances as a function of graph distance (number of edge hops between nodes). The clear upward trend validates that our generation tree creates a coherent semantic space, where topological proximity translates to semantic similarity. agent reduces VQAScore from 0.90 to 0.87, while Hierarchical Consistency remains stab… view at source ↗

**Figure 12.** Figure 12: Additional structured diversity results. For each user prompt, outer gray panels group images derived from the same initial scene. Colored boxes distinguish sibling branches (parallel variations that share the same parent but differ from one another by a single semantic aspect) [PITH_FULL_IMAGE:figures/full_fig_p013_12.png] view at source ↗

**Figure 13.** Figure 13: Additional structured diversity results. For each user prompt, outer gray panels group images derived from the same initial scene. Colored boxes distinguish sibling branches (parallel variations that share the same parent but differ from one another by a single semantic aspect) [PITH_FULL_IMAGE:figures/full_fig_p014_13.png] view at source ↗

**Figure 14.** Figure 14: Qualitative comparison on the prompt: “A toilet sits next to a bathtub in an empty bathroom.” Columns 2 and 5-7 report results using consecutive seeds with hyperparameters optimized for diversity. Columns 3-4 display the most diverse subset of four images selected from a larger candidate pool. While baseline methods exhibit limited variation, our method (column 1) successfully presents distinct and cohere… view at source ↗

**Figure 15.** Figure 15: Qualitative comparison on the prompt: “A small train moving along the tracks with a mountain town in the background.” Columns 2 and 5-7 report results using consecutive seeds with hyperparameters optimized for diversity. Columns 3-4 display the most diverse subset of four images selected from a larger candidate pool. While baseline methods exhibit limited variation, our method (column 1) successfully pres… view at source ↗

**Figure 16.** Figure 16: Qualitative comparison on the prompt: “A woman in a red dress standing on top of a lush green field.” Columns 2 and 5-7 report results using consecutive seeds with hyperparameters optimized for diversity. Columns 3-4 display the most diverse subset of four images selected from a larger candidate pool. While baseline methods exhibit limited variation, our method (column 1) successfully presents distinct an… view at source ↗

**Figure 17.** Figure 17: Context Analyst System Prompt You are a creative planner proposing DIVERSITY TREE branching axes. Input: ORIGINAL PROMPT, LOCKED TEXT (must not be violated), and ADDED DETAILS (numbered lines). Task: propose 3–6 SCENE-SPECIFIC aspect candidates. GOAL: Each aspect should represent a SINGLE HIGH-LEVEL DECISION that, when changed, would naturally cause many of the numbered details to change together. Aspects… view at source ↗

**Figure 18.** Figure 18: Brainstormer System Prompt [PITH_FULL_IMAGE:figures/full_fig_p018_18.png] view at source ↗

**Figure 19.** Figure 19: Decision Maker System Prompt You are a quality-control CRITIC for a diversity-tree image editing pipeline. Your job is to revise the chooser's edit instructions so they are (1) are PROMPT-ADHERENT and (2) constraint-safe and (3) strong, aspect-faithful edits. WHAT YOU ARE GIVEN: ORIGINAL USER PROMPT: the user's intent (highest priority - must be preserved). ACCUMULATED CONSTRAINTS: hard requirements colle… view at source ↗

read the original abstract

Modern text-to-image models excel in visual fidelity and prompt adherence. However, this strict adherence comes at the cost of diversity: generated samples tend to collapse into a single visual interpretation. Existing methods to improve diversity produce outputs driven by incidental variations rather than meaningful design choices. This motivates a new variant of the diversity task where structure is enforced on the generated samples. We introduce a method for controlled diversity that enables Semantic Browsing, where users can navigate structured image galleries and experience creative exploration through a systematic traversal of meaningful, interpretable axes of variation. Achieving this level of semantic control requires a deep understanding of the scene. We exploit the fact that recent text-to-image models are trained on elaborated captions, effectively decoupling semantic decision-making from pixel generation. This enables a paradigm shift: instead of relying on stochastic variation within the text-to-image model, we induce diversity directly at the text level. By leveraging rich textual representations, we allow a Vision Language Model (VLM) to operate on the full scene context. To overcome the generic outputs typical of standard VLMs, we employ an agentic workflow that explicitly enforces structured variation attuned to the original prompt. We demonstrate that our method produces diverse and navigable design spaces where every variation corresponds to a specific, user-understandable semantic decision.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's agentic VLM workflow for text-level structured diversity is a clear framing shift, but the decoupling premise stays untested and the abstract supplies no results to back the navigability claim.

read the letter

The main thing here is an attempt to move diversity control upstream into prompt editing via an agentic VLM loop, so that each generated image sits on an explicit semantic axis rather than arising from incidental model noise.

This is new relative to the usual seed-variation or latent-space tricks. The authors correctly note that most existing diversity methods do not produce galleries a user can traverse by meaningful decisions, and they try to enforce that structure at the caption level.

The approach does a decent job of naming the practical problem for design workflows. Treating rich captions as modular instructions is a reasonable hypothesis given how current T2I models are trained.

The soft spot is that nothing in the description shows the hypothesis holds. The abstract states the decoupling as fact but gives no prompt-ablation results, attribution checks, or even sample galleries to confirm that editing one semantic clause changes only the intended visual element instead of correlated or ignored effects. Without those measurements the method collapses to careful prompt engineering whose outputs may still be incidental.

The paper is aimed at people building controllable interfaces on top of existing generators. A reader who needs a concrete way to organize variation axes would get some value from the workflow description, even if the validation is missing.

It deserves a serious referee. The idea is distinct enough from prior diversity work that checking the implementation details and any quantitative or user-study evidence is worth the time.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces Semantic Browsing, a method for controlled diversity in text-to-image generation. It claims that recent T2I models trained on elaborated captions decouple semantic decision-making from pixel generation, allowing an agentic VLM workflow to induce structured textual variations. This produces navigable design spaces in which every output variation maps to a specific, user-understandable semantic decision rather than incidental stochastic changes.

Significance. If empirically validated, the approach could meaningfully advance controllable generation by shifting diversity induction from pixel-level noise to interpretable text-level axes, enabling more purposeful creative exploration tools. The reliance on existing models without retraining is a practical strength, and the emphasis on user-understandable semantics addresses a real usability gap in current generative systems.

major comments (2)

[Abstract] Abstract (paradigm shift paragraph): the central premise that elaborated captions 'effectively decouple semantic decision-making from pixel generation' is asserted without any supporting measurement. No prompt-ablation studies, attribution analysis, or sensitivity tests are referenced to demonstrate that text edits produce isolated, interpretable image changes rather than correlated or ignored variations; this assumption is load-bearing for the claimed paradigm shift.
[Abstract] Abstract (final sentence): the claim that the method 'produces diverse and navigable design spaces where every variation corresponds to a specific, user-understandable semantic decision' is presented as a demonstrated result, yet the provided text contains no quantitative metrics, qualitative examples, failure-case analysis, or comparison to baseline prompt-engineering methods that would allow evaluation of whether the agentic workflow avoids collapse or hallucination.

minor comments (1)

The abstract would be strengthened by naming the specific T2I and VLM models employed and by indicating the scale of any user studies or automated evaluations performed.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the two major comments point-by-point below and indicate planned revisions to the abstract.

read point-by-point responses

Referee: [Abstract] Abstract (paradigm shift paragraph): the central premise that elaborated captions 'effectively decouple semantic decision-making from pixel generation' is asserted without any supporting measurement. No prompt-ablation studies, attribution analysis, or sensitivity tests are referenced to demonstrate that text edits produce isolated, interpretable image changes rather than correlated or ignored variations; this assumption is load-bearing for the claimed paradigm shift.

Authors: We agree the abstract states the decoupling premise without referencing supporting measurements inside the abstract itself. The full manuscript contains experiments and analysis showing the impact of targeted text edits on image outputs. We will revise the abstract to add a brief reference to the relevant experimental sections that provide this supporting evidence. revision: yes
Referee: [Abstract] Abstract (final sentence): the claim that the method 'produces diverse and navigable design spaces where every variation corresponds to a specific, user-understandable semantic decision' is presented as a demonstrated result, yet the provided text contains no quantitative metrics, qualitative examples, failure-case analysis, or comparison to baseline prompt-engineering methods that would allow evaluation of whether the agentic workflow avoids collapse or hallucination.

Authors: The abstract is a high-level summary; the manuscript body supplies the qualitative examples, baseline comparisons, and workflow analysis. We will revise the abstract to qualify the claim by noting that the supporting demonstrations appear in the main text, avoiding any implication that the abstract alone contains the full evaluation. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained

full rationale

The paper describes a methodological workflow for semantic browsing via agentic VLM on top of existing T2I models. It states an external premise about elaborated captions decoupling semantics from pixels but does not derive this premise from its own inputs, equations, or self-citations. No fitted parameters, predictions, uniqueness theorems, or ansatzes are introduced that reduce to the paper's own definitions or prior self-work by construction. The approach is presented as leveraging off-the-shelf models without any load-bearing self-referential reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central approach rests on one key domain assumption about model training data; no free parameters or invented entities are explicitly introduced in the abstract.

axioms (1)

domain assumption Recent text-to-image models trained on elaborated captions effectively decouple semantic decision-making from pixel generation.
Invoked to justify inducing diversity at the text level rather than within the image model.

pith-pipeline@v0.9.1-grok · 5772 in / 1036 out tokens · 29874 ms · 2026-06-26T09:01:44.545519+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

227 extracted references · 14 canonical work pages

[1]

arXiv preprint arXiv:2310.17347 , year=

CADS: Unleashing the diversity of diffusion models through condition-annealed sampling , author=. arXiv preprint arXiv:2310.17347 , year=

arXiv
[2]

arXiv preprint arXiv:2310.13102 , year=

Particle guidance: non-iid diverse sampling with diffusion models , author=. arXiv preprint arXiv:2310.13102 , year=

arXiv
[3]

Advances in Neural Information Processing Systems , volume=

Applying guidance in a limited interval improves sample and distribution quality in diffusion models , author=. Advances in Neural Information Processing Systems , volume=
[4]

Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

Minority-Focused Text-to-Image Generation via Prompt Optimization , author=. Proceedings of the Computer Vision and Pattern Recognition Conference , pages=
[5]

arXiv preprint arXiv:2508.15773 , year=

Scaling Group Inference for Diverse and High-Quality Generation , author=. arXiv preprint arXiv:2508.15773 , year=

arXiv
[6]

Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

Learning to sample effective and diverse prompts for text-to-image generation , author=. Proceedings of the Computer Vision and Pattern Recognition Conference , pages=
[7]

arXiv preprint arXiv:2509.10704 , year=

Maestro: Self-improving text-to-image generation via agent orchestration , author=. arXiv preprint arXiv:2509.10704 , year=

arXiv
[8]

Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: System Demonstrations , pages=

Promptsculptor: Multi-agent based text-to-image prompt optimization , author=. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: System Demonstrations , pages=

2025
[9]

arXiv preprint arXiv:2412.06771 , year=

Proactive agents for multi-turn text-to-image generation under uncertainty , author=. arXiv preprint arXiv:2412.06771 , year=

arXiv
[10]

arXiv preprint arXiv:2207.12598 , year=

Classifier-free diffusion guidance , author=. arXiv preprint arXiv:2207.12598 , year=

Pith/arXiv arXiv
[11]

FirstName LastName , title =
[12]

FirstName Alpher , title =
[13]

Journal of Foo , volume = 13, number = 1, pages =

FirstName Alpher and FirstName Fotheringham-Smythe , title =. Journal of Foo , volume = 13, number = 1, pages =
[14]

Journal of Foo , volume = 14, number = 1, pages =

FirstName Alpher and FirstName Fotheringham-Smythe and FirstName Gamow , title =. Journal of Foo , volume = 14, number = 1, pages =
[15]

FirstName Alpher and FirstName Gamow , title =
[16]

Scaling Learning Algorithms Towards

Bengio, Yoshua and LeCun, Yann , booktitle =. Scaling Learning Algorithms Towards
[17]

and Osindero, Simon and Teh, Yee Whye , journal =

Hinton, Geoffrey E. and Osindero, Simon and Teh, Yee Whye , journal =. A Fast Learning Algorithm for Deep Belief Nets , volume =
[18]

2016 , publisher=

Deep learning , author=. 2016 , publisher=

2016
[19]

2023 , eprint=

Sparse Autoencoders Find Highly Interpretable Features in Language Models , author=. 2023 , eprint=

2023
[20]

2025 , eprint=

Learning Multi-Level Features with Matryoshka Sparse Autoencoders , author=. 2025 , eprint=

2025
[21]

2025 , eprint=

Follow the Flow: On Information Flow Across Textual Tokens in Text-to-Image Models , author=. 2025 , eprint=

2025
[22]

2020 , eprint=

Denoising Diffusion Probabilistic Models , author=. 2020 , eprint=

2020
[23]

2022 , eprint=

High-Resolution Image Synthesis with Latent Diffusion Models , author=. 2022 , eprint=

2022
[24]

2023 , eprint=

SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis , author=. 2023 , eprint=

2023
[25]

2022 , eprint=

Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding , author=. 2022 , eprint=

2022
[26]

2022 , eprint=

Hierarchical Text-Conditional Image Generation with CLIP Latents , author=. 2022 , eprint=

2022
[27]

2022 , eprint=

SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations , author=. 2022 , eprint=

2022
[28]

2022 , eprint=

Prompt-to-Prompt Image Editing with Cross Attention Control , author=. 2022 , eprint=

2022
[29]

2023 , eprint=

InstructPix2Pix: Learning to Follow Image Editing Instructions , author=. 2023 , eprint=

2023
[30]

Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , year =

Patashnik, Or and Garibi, Daniel and Azuri, Idan and Averbuch-Elor, Hadar and Cohen-Or, Daniel , title =. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , year =
[31]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Blended diffusion for text-driven editing of natural images , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
[32]

arXiv preprint arXiv:2206.02779 , year=

Blended Latent Diffusion , author=. arXiv preprint arXiv:2206.02779 , year=

arXiv
[33]

ArXiv , year=

DiffEdit: Diffusion-based semantic image editing with mask guidance , author=. ArXiv , year=
[34]

2022 , eprint=

Denoising Diffusion Implicit Models , author=. 2022 , eprint=

2022
[35]

2022 , eprint=

Null-text Inversion for Editing Real Images using Guided Diffusion Models , author=. 2022 , eprint=

2022
[36]

2023 , eprint=

Negative-prompt Inversion: Fast Image Inversion for Editing with Text-guided Diffusion Models , author=. 2023 , eprint=

2023
[37]

2023 , eprint=

Improving Tuning-Free Real Image Editing with Proximal Guidance , author=. 2023 , eprint=

2023
[38]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =

Tumanyan, Narek and Geyer, Michal and Bagon, Shai and Dekel, Tali , title =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =. 2023 , pages =

2023
[39]

Conference on Computer Vision and Pattern Recognition 2023 , year=

Imagic: Text-Based Real Image Editing with Diffusion Models , author=. Conference on Computer Vision and Pattern Recognition 2023 , year=

2023
[40]

arXiv preprint arXiv:2304.08465 , year=

MasaCtrl: Tuning-Free Mutual Self-Attention Control for Consistent Image Synthesis and Editing , author=. arXiv preprint arXiv:2304.08465 , year=

arXiv
[41]

Proceedings of the IEEE/CVF International Conference on Computer Vision , year=

Expressive text-to-image generation with rich text , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , year=
[42]

2023 , eprint=

Cross-Image Attention for Zero-Shot Appearance Transfer , author=. 2023 , eprint=

2023
[43]

2024 , eprint=

ReNoise: Real Image Inversion Through Iterative Noising , author=. 2024 , eprint=

2024
[44]

2024 , eprint=

TurboEdit: Text-Based Image Editing Using Few-Step Diffusion Models , author=. 2024 , eprint=

2024
[45]

2023 , eprint=

Latent Consistency Models: Synthesizing High-Resolution Images with Few-Step Inference , author=. 2023 , eprint=

2023
[46]

2023 , eprint=

Adversarial Diffusion Distillation , author=. 2023 , eprint=

2023
[47]

arXiv preprint arXiv:2311.05556 , year=

LCM-LoRA: A Universal Stable-Diffusion Acceleration Module , author=. arXiv preprint arXiv:2311.05556 , year=

arXiv
[48]

2023 , eprint=

Consistency Models , author=. 2023 , eprint=

2023
[49]

Bermano, Amit , title =

Arar, Moab and Gal, Rinon and Atzmon, Yuval and Chechik, Gal and Cohen-Or, Daniel and Shamir, Ariel and H. Bermano, Amit , title =. 2023 , isbn =. doi:10.1145/3610548.3618173 , booktitle =

work page doi:10.1145/3610548.3618173 2023
[50]

SIGGRAPH Asia 2024 Conference Papers , pages=

Moa: Mixture-of-attention for subject-context disentanglement in personalized image generation , author=. SIGGRAPH Asia 2024 Conference Papers , pages=

2024
[51]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

Ominicontrol: Minimal and universal control for diffusion transformer , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=
[52]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Hyperdreambooth: Hypernetworks for fast personalization of text-to-image models , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
[53]

2022 , eprint=

Classifier-Free Diffusion Guidance , author=. 2022 , eprint=

2022
[54]

International Conference on Learning Representations , year=

Score-Based Generative Modeling through Stochastic Differential Equations , author=. International Conference on Learning Representations , year=
[55]

Advances in Neural Information Processing Systems , volume=

Elucidating the design space of diffusion-based generative models , author=. Advances in Neural Information Processing Systems , volume=
[56]

2021 , eprint=

Diffusion Models Beat GANs on Image Synthesis , author=. 2021 , eprint=

2021
[57]

arXiv preprint arxiv:2312.02133 , year=

Style Aligned Image Generation via Shared Attention , author=. arXiv preprint arxiv:2312.02133 , year=

arXiv
[58]

Advances in Neural Information Processing Systems , year=

Diffusion Self-Guidance for Controllable Image Generation , author=. Advances in Neural Information Processing Systems , year=
[59]

2023 , eprint=

SEGA: Instructing Text-to-Image Models using Semantic Guidance , author=. 2023 , eprint=

2023
[60]

URLhttp://dx.doi.org/10.1145/3588432.3591513

Parmar, Gaurav and Kumar Singh, Krishna and Zhang, Richard and Li, Yijun and Lu, Jingwan and Zhu, Jun-Yan , year=. Zero-shot Image-to-Image Translation , url=. doi:10.1145/3588432.3591513 , booktitle=

work page doi:10.1145/3588432.3591513
[61]

arXiv preprint arxiv:2311.17609 , year=

AnyLens: A Generative Diffusion Model with Any Rendering Lens , author=. arXiv preprint arxiv:2311.17609 , year=

arXiv
[62]

2023 , eprint=

An Edit Friendly DDPM Noise Space: Inversion and Manipulations , author=. 2023 , eprint=

2023
[63]

Proceedings of European Conference on Computer Vision (ECCV) , year=

Generative Visual Manipulation on the Natural Image Manifold , author=. Proceedings of European Conference on Computer Vision (ECCV) , year=
[64]

IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =

Richardson, Elad and Alaluf, Yuval and Patashnik, Or and Nitzan, Yotam and Azar, Yaniv and Shapiro, Stav and Cohen-Or, Daniel , title =. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =
[65]

2021 , issue_date =

Tov, Omer and Alaluf, Yuval and Nitzan, Yotam and Patashnik, Or and Cohen-Or, Daniel , title =. 2021 , issue_date =. doi:10.1145/3450626.3459838 , journal =

work page doi:10.1145/3450626.3459838 2021
[66]

Understanding and

Alaluf, Yuval and Patashnik, Or and Cohen-Or, Daniel , year=. ReStyle: A Residual-Based StyleGAN Encoder via Iterative Refinement , url=. doi:10.1109/iccv48922.2021.00664 , booktitle=

work page doi:10.1109/iccv48922.2021.00664 2021
[67]

and Cohen-Or, Daniel , year=

Roich, Daniel and Mokady, Ron and Bermano, Amit H. and Cohen-Or, Daniel , year=. Pivotal Tuning for Latent-based Editing of Real Images , volume=. ACM Transactions on Graphics , publisher=. doi:10.1145/3544777 , number=

work page doi:10.1145/3544777
[68]

A Style-Based Generator Architecture for Generative Adversarial Networks , isbn =

Abdal, Rameen and Qin, Yipeng and Wonka, Peter , year=. Image2StyleGAN: How to Embed Images Into the StyleGAN Latent Space? , url=. doi:10.1109/iccv.2019.00453 , booktitle=

work page doi:10.1109/iccv.2019.00453 2019
[69]

Local deep im- plicit functions for 3d shape

Abdal, Rameen and Qin, Yipeng and Wonka, Peter , year=. Image2StyleGAN++: How to Edit the Embedded Images? , url=. doi:10.1109/cvpr42600.2020.00832 , booktitle=

work page doi:10.1109/cvpr42600.2020.00832 2020
[70]

2020 , eprint=

Improved StyleGAN Embedding: Where are the Good Latents? , author=. 2020 , eprint=

2020
[71]

Proceedings of European Conference on Computer Vision (ECCV) , year =

In-domain GAN Inversion for Real Image Editing , author =. Proceedings of European Conference on Computer Vision (ECCV) , year =
[72]

A ConvNet for the 2020s , booktitle =

Parmar, Gaurav and Li, Yijun and Lu, Jingwan and Zhang, Richard and Zhu, Jun-Yan and Singh, Krishna Kumar , year=. Spatially-Adaptive Multilayer Selection for GAN Inversion and Editing , url=. doi:10.1109/cvpr52688.2022.01111 , booktitle=

work page doi:10.1109/cvpr52688.2022.01111 2022
[73]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

HyperInverter: Improving StyleGAN Inversion via Hypernetwork , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=
[74]

arXiv preprint arXiv:2210.05559 , year=

Unifying Diffusion Models' Latent Space, with Applications to CycleDiffusion and Guidance , author=. arXiv preprint arXiv:2210.05559 , year=

arXiv
[75]

International Conference on Learning Representations , year=

Progressive Distillation for Fast Sampling of Diffusion Models , author=. International Conference on Learning Representations , year=
[76]

CVPR , year=

The Unreasonable Effectiveness of Deep Features as a Perceptual Metric , author=. CVPR , year=
[77]

arXiv preprint arXiv:2301.12597 , year=

Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models , author=. arXiv preprint arXiv:2301.12597 , year=

Pith/arXiv arXiv
[78]

The Eleventh International Conference on Learning Representations , year=

Understanding DDPM Latent Codes Through Optimal Transport , author=. The Eleventh International Conference on Learning Representations , year=
[79]

Microsoft

Tsung. Microsoft. CoRR , volume =. 2014 , archivePrefix =. 1405.0312 , timestamp =

Pith/arXiv arXiv 2014
[80]

arXiv preprint arXiv:2211.12446 , year=

EDICT: Exact Diffusion Inversion via Coupled Transformations , author=. arXiv preprint arXiv:2211.12446 , year=

arXiv

Showing first 80 references.

[1] [1]

arXiv preprint arXiv:2310.17347 , year=

CADS: Unleashing the diversity of diffusion models through condition-annealed sampling , author=. arXiv preprint arXiv:2310.17347 , year=

arXiv

[2] [2]

arXiv preprint arXiv:2310.13102 , year=

Particle guidance: non-iid diverse sampling with diffusion models , author=. arXiv preprint arXiv:2310.13102 , year=

arXiv

[3] [3]

Advances in Neural Information Processing Systems , volume=

Applying guidance in a limited interval improves sample and distribution quality in diffusion models , author=. Advances in Neural Information Processing Systems , volume=

[4] [4]

Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

Minority-Focused Text-to-Image Generation via Prompt Optimization , author=. Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

[5] [5]

arXiv preprint arXiv:2508.15773 , year=

Scaling Group Inference for Diverse and High-Quality Generation , author=. arXiv preprint arXiv:2508.15773 , year=

arXiv

[6] [6]

Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

Learning to sample effective and diverse prompts for text-to-image generation , author=. Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

[7] [7]

arXiv preprint arXiv:2509.10704 , year=

Maestro: Self-improving text-to-image generation via agent orchestration , author=. arXiv preprint arXiv:2509.10704 , year=

arXiv

[8] [8]

Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: System Demonstrations , pages=

Promptsculptor: Multi-agent based text-to-image prompt optimization , author=. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: System Demonstrations , pages=

2025

[9] [9]

arXiv preprint arXiv:2412.06771 , year=

Proactive agents for multi-turn text-to-image generation under uncertainty , author=. arXiv preprint arXiv:2412.06771 , year=

arXiv

[10] [10]

arXiv preprint arXiv:2207.12598 , year=

Classifier-free diffusion guidance , author=. arXiv preprint arXiv:2207.12598 , year=

Pith/arXiv arXiv

[11] [11]

FirstName LastName , title =

[12] [12]

FirstName Alpher , title =

[13] [13]

Journal of Foo , volume = 13, number = 1, pages =

FirstName Alpher and FirstName Fotheringham-Smythe , title =. Journal of Foo , volume = 13, number = 1, pages =

[14] [14]

Journal of Foo , volume = 14, number = 1, pages =

FirstName Alpher and FirstName Fotheringham-Smythe and FirstName Gamow , title =. Journal of Foo , volume = 14, number = 1, pages =

[15] [15]

FirstName Alpher and FirstName Gamow , title =

[16] [16]

Scaling Learning Algorithms Towards

Bengio, Yoshua and LeCun, Yann , booktitle =. Scaling Learning Algorithms Towards

[17] [17]

and Osindero, Simon and Teh, Yee Whye , journal =

Hinton, Geoffrey E. and Osindero, Simon and Teh, Yee Whye , journal =. A Fast Learning Algorithm for Deep Belief Nets , volume =

[18] [18]

2016 , publisher=

Deep learning , author=. 2016 , publisher=

2016

[19] [19]

2023 , eprint=

Sparse Autoencoders Find Highly Interpretable Features in Language Models , author=. 2023 , eprint=

2023

[20] [20]

2025 , eprint=

Learning Multi-Level Features with Matryoshka Sparse Autoencoders , author=. 2025 , eprint=

2025

[21] [21]

2025 , eprint=

Follow the Flow: On Information Flow Across Textual Tokens in Text-to-Image Models , author=. 2025 , eprint=

2025

[22] [22]

2020 , eprint=

Denoising Diffusion Probabilistic Models , author=. 2020 , eprint=

2020

[23] [23]

2022 , eprint=

High-Resolution Image Synthesis with Latent Diffusion Models , author=. 2022 , eprint=

2022

[24] [24]

2023 , eprint=

SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis , author=. 2023 , eprint=

2023

[25] [25]

2022 , eprint=

Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding , author=. 2022 , eprint=

2022

[26] [26]

2022 , eprint=

Hierarchical Text-Conditional Image Generation with CLIP Latents , author=. 2022 , eprint=

2022

[27] [27]

2022 , eprint=

SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations , author=. 2022 , eprint=

2022

[28] [28]

2022 , eprint=

Prompt-to-Prompt Image Editing with Cross Attention Control , author=. 2022 , eprint=

2022

[29] [29]

2023 , eprint=

InstructPix2Pix: Learning to Follow Image Editing Instructions , author=. 2023 , eprint=

2023

[30] [30]

Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , year =

Patashnik, Or and Garibi, Daniel and Azuri, Idan and Averbuch-Elor, Hadar and Cohen-Or, Daniel , title =. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , year =

[31] [31]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Blended diffusion for text-driven editing of natural images , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

[32] [32]

arXiv preprint arXiv:2206.02779 , year=

Blended Latent Diffusion , author=. arXiv preprint arXiv:2206.02779 , year=

arXiv

[33] [33]

ArXiv , year=

DiffEdit: Diffusion-based semantic image editing with mask guidance , author=. ArXiv , year=

[34] [34]

2022 , eprint=

Denoising Diffusion Implicit Models , author=. 2022 , eprint=

2022

[35] [35]

2022 , eprint=

Null-text Inversion for Editing Real Images using Guided Diffusion Models , author=. 2022 , eprint=

2022

[36] [36]

2023 , eprint=

Negative-prompt Inversion: Fast Image Inversion for Editing with Text-guided Diffusion Models , author=. 2023 , eprint=

2023

[37] [37]

2023 , eprint=

Improving Tuning-Free Real Image Editing with Proximal Guidance , author=. 2023 , eprint=

2023

[38] [38]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =

Tumanyan, Narek and Geyer, Michal and Bagon, Shai and Dekel, Tali , title =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =. 2023 , pages =

2023

[39] [39]

Conference on Computer Vision and Pattern Recognition 2023 , year=

Imagic: Text-Based Real Image Editing with Diffusion Models , author=. Conference on Computer Vision and Pattern Recognition 2023 , year=

2023

[40] [40]

arXiv preprint arXiv:2304.08465 , year=

MasaCtrl: Tuning-Free Mutual Self-Attention Control for Consistent Image Synthesis and Editing , author=. arXiv preprint arXiv:2304.08465 , year=

arXiv

[41] [41]

Proceedings of the IEEE/CVF International Conference on Computer Vision , year=

Expressive text-to-image generation with rich text , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , year=

[42] [42]

2023 , eprint=

Cross-Image Attention for Zero-Shot Appearance Transfer , author=. 2023 , eprint=

2023

[43] [43]

2024 , eprint=

ReNoise: Real Image Inversion Through Iterative Noising , author=. 2024 , eprint=

2024

[44] [44]

2024 , eprint=

TurboEdit: Text-Based Image Editing Using Few-Step Diffusion Models , author=. 2024 , eprint=

2024

[45] [45]

2023 , eprint=

Latent Consistency Models: Synthesizing High-Resolution Images with Few-Step Inference , author=. 2023 , eprint=

2023

[46] [46]

2023 , eprint=

Adversarial Diffusion Distillation , author=. 2023 , eprint=

2023

[47] [47]

arXiv preprint arXiv:2311.05556 , year=

LCM-LoRA: A Universal Stable-Diffusion Acceleration Module , author=. arXiv preprint arXiv:2311.05556 , year=

arXiv

[48] [48]

2023 , eprint=

Consistency Models , author=. 2023 , eprint=

2023

[49] [49]

Bermano, Amit , title =

Arar, Moab and Gal, Rinon and Atzmon, Yuval and Chechik, Gal and Cohen-Or, Daniel and Shamir, Ariel and H. Bermano, Amit , title =. 2023 , isbn =. doi:10.1145/3610548.3618173 , booktitle =

work page doi:10.1145/3610548.3618173 2023

[50] [50]

SIGGRAPH Asia 2024 Conference Papers , pages=

Moa: Mixture-of-attention for subject-context disentanglement in personalized image generation , author=. SIGGRAPH Asia 2024 Conference Papers , pages=

2024

[51] [51]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

Ominicontrol: Minimal and universal control for diffusion transformer , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

[52] [52]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Hyperdreambooth: Hypernetworks for fast personalization of text-to-image models , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

[53] [53]

2022 , eprint=

Classifier-Free Diffusion Guidance , author=. 2022 , eprint=

2022

[54] [54]

International Conference on Learning Representations , year=

Score-Based Generative Modeling through Stochastic Differential Equations , author=. International Conference on Learning Representations , year=

[55] [55]

Advances in Neural Information Processing Systems , volume=

Elucidating the design space of diffusion-based generative models , author=. Advances in Neural Information Processing Systems , volume=

[56] [56]

2021 , eprint=

Diffusion Models Beat GANs on Image Synthesis , author=. 2021 , eprint=

2021

[57] [57]

arXiv preprint arxiv:2312.02133 , year=

Style Aligned Image Generation via Shared Attention , author=. arXiv preprint arxiv:2312.02133 , year=

arXiv

[58] [58]

Advances in Neural Information Processing Systems , year=

Diffusion Self-Guidance for Controllable Image Generation , author=. Advances in Neural Information Processing Systems , year=

[59] [59]

2023 , eprint=

SEGA: Instructing Text-to-Image Models using Semantic Guidance , author=. 2023 , eprint=

2023

[60] [60]

URLhttp://dx.doi.org/10.1145/3588432.3591513

Parmar, Gaurav and Kumar Singh, Krishna and Zhang, Richard and Li, Yijun and Lu, Jingwan and Zhu, Jun-Yan , year=. Zero-shot Image-to-Image Translation , url=. doi:10.1145/3588432.3591513 , booktitle=

work page doi:10.1145/3588432.3591513

[61] [61]

arXiv preprint arxiv:2311.17609 , year=

AnyLens: A Generative Diffusion Model with Any Rendering Lens , author=. arXiv preprint arxiv:2311.17609 , year=

arXiv

[62] [62]

2023 , eprint=

An Edit Friendly DDPM Noise Space: Inversion and Manipulations , author=. 2023 , eprint=

2023

[63] [63]

Proceedings of European Conference on Computer Vision (ECCV) , year=

Generative Visual Manipulation on the Natural Image Manifold , author=. Proceedings of European Conference on Computer Vision (ECCV) , year=

[64] [64]

IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =

Richardson, Elad and Alaluf, Yuval and Patashnik, Or and Nitzan, Yotam and Azar, Yaniv and Shapiro, Stav and Cohen-Or, Daniel , title =. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =

[65] [65]

2021 , issue_date =

Tov, Omer and Alaluf, Yuval and Nitzan, Yotam and Patashnik, Or and Cohen-Or, Daniel , title =. 2021 , issue_date =. doi:10.1145/3450626.3459838 , journal =

work page doi:10.1145/3450626.3459838 2021

[66] [66]

Understanding and

Alaluf, Yuval and Patashnik, Or and Cohen-Or, Daniel , year=. ReStyle: A Residual-Based StyleGAN Encoder via Iterative Refinement , url=. doi:10.1109/iccv48922.2021.00664 , booktitle=

work page doi:10.1109/iccv48922.2021.00664 2021

[67] [67]

and Cohen-Or, Daniel , year=

Roich, Daniel and Mokady, Ron and Bermano, Amit H. and Cohen-Or, Daniel , year=. Pivotal Tuning for Latent-based Editing of Real Images , volume=. ACM Transactions on Graphics , publisher=. doi:10.1145/3544777 , number=

work page doi:10.1145/3544777

[68] [68]

A Style-Based Generator Architecture for Generative Adversarial Networks , isbn =

Abdal, Rameen and Qin, Yipeng and Wonka, Peter , year=. Image2StyleGAN: How to Embed Images Into the StyleGAN Latent Space? , url=. doi:10.1109/iccv.2019.00453 , booktitle=

work page doi:10.1109/iccv.2019.00453 2019

[69] [69]

Local deep im- plicit functions for 3d shape

Abdal, Rameen and Qin, Yipeng and Wonka, Peter , year=. Image2StyleGAN++: How to Edit the Embedded Images? , url=. doi:10.1109/cvpr42600.2020.00832 , booktitle=

work page doi:10.1109/cvpr42600.2020.00832 2020

[70] [70]

2020 , eprint=

Improved StyleGAN Embedding: Where are the Good Latents? , author=. 2020 , eprint=

2020

[71] [71]

Proceedings of European Conference on Computer Vision (ECCV) , year =

In-domain GAN Inversion for Real Image Editing , author =. Proceedings of European Conference on Computer Vision (ECCV) , year =

[72] [72]

A ConvNet for the 2020s , booktitle =

Parmar, Gaurav and Li, Yijun and Lu, Jingwan and Zhang, Richard and Zhu, Jun-Yan and Singh, Krishna Kumar , year=. Spatially-Adaptive Multilayer Selection for GAN Inversion and Editing , url=. doi:10.1109/cvpr52688.2022.01111 , booktitle=

work page doi:10.1109/cvpr52688.2022.01111 2022

[73] [73]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

HyperInverter: Improving StyleGAN Inversion via Hypernetwork , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

[74] [74]

arXiv preprint arXiv:2210.05559 , year=

Unifying Diffusion Models' Latent Space, with Applications to CycleDiffusion and Guidance , author=. arXiv preprint arXiv:2210.05559 , year=

arXiv

[75] [75]

International Conference on Learning Representations , year=

Progressive Distillation for Fast Sampling of Diffusion Models , author=. International Conference on Learning Representations , year=

[76] [76]

CVPR , year=

The Unreasonable Effectiveness of Deep Features as a Perceptual Metric , author=. CVPR , year=

[77] [77]

arXiv preprint arXiv:2301.12597 , year=

Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models , author=. arXiv preprint arXiv:2301.12597 , year=

Pith/arXiv arXiv

[78] [78]

The Eleventh International Conference on Learning Representations , year=

Understanding DDPM Latent Codes Through Optimal Transport , author=. The Eleventh International Conference on Learning Representations , year=

[79] [79]

Microsoft

Tsung. Microsoft. CoRR , volume =. 2014 , archivePrefix =. 1405.0312 , timestamp =

Pith/arXiv arXiv 2014

[80] [80]

arXiv preprint arXiv:2211.12446 , year=

EDICT: Exact Diffusion Inversion via Coupled Transformations , author=. arXiv preprint arXiv:2211.12446 , year=

arXiv