Controlla: Learning Controllability via Graph-Constrained Latent Geometry

Amin Karimi Monsefi; Jamuna S. Murthy; Rajiv Ramnath

arxiv: 2605.16603 · v1 · pith:LHZARPLUnew · submitted 2026-05-15 · 💻 cs.CV

Controlla: Learning Controllability via Graph-Constrained Latent Geometry

Jamuna S. Murthy , Amin Karimi Monsefi , Rajiv Ramnath This is my paper

Pith reviewed 2026-05-20 18:34 UTC · model grok-4.3

classification 💻 cs.CV

keywords controllable multimodal generationlatent geometrygraph priorsoptimal transportidentity preservationaffective controltrajectory consistency

0 comments

The pith

Controlla structures latent geometry with graph priors so attributes evolve along consistent paths while identity stays fixed.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes that controllability in multimodal generation can be achieved by learning identity and attribute factors and aligning them to graph priors through graph-constrained optimal transport. This approach encourages attributes to change in ways that respect predefined semantic relationships across different input types. A sympathetic reader would care because current conditioning methods often cause identity to drift or produce inconsistent behaviors when switching between modalities. By making the latent space follow graph-consistent trajectories, the method aims to produce more reliable control without needing extra guidance at inference time. The authors introduce a new benchmark called AffectHuman-43K to test reference-grounded affective control and metrics that check trajectory consistency and disentanglement.

Core claim

Controlla is a modular factorized-control framework that treats controllability as a property of structured latent geometry. It learns identity and attribute factors from multimodal inputs and aligns them with graph priors using graph-constrained optimal transport. This encourages attributes to follow graph-consistent trajectories while preserving reference identity. Experiments on the AffectHuman-43K benchmark demonstrate improvements in controllability, identity preservation, and cross-modal alignment.

What carries the argument

Graph-constrained optimal transport, which aligns learned identity and attribute factors with graph priors to enforce consistent trajectories in latent space.

If this is right

Attributes follow graph-consistent trajectories across modalities.
Identity is preserved better during control operations.
Cross-modal alignment improves in generated outputs.
The framework shows robustness and extensibility according to the reported analyses.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If similar graph priors can be defined for other attribute sets, the alignment step might transfer to tasks such as expression control in video or pose manipulation in 3D.
The explicit separation of identity and attribute factors could support modular systems where one factor is held constant while another is adjusted independently.
Geometry-aware evaluation metrics could be applied to test disentanglement in other generative models that operate on multimodal inputs.

Load-bearing premise

The graph priors correctly encode the semantic relationships and trajectories among attributes across modalities.

What would settle it

An observation that attribute trajectories in the latent space fail to match the graph structure or that identity preservation metrics show no gain over baseline conditioning methods on the AffectHuman-43K benchmark would disprove the claim.

Figures

Figures reproduced from arXiv: 2605.16603 by Amin Karimi Monsefi, Jamuna S. Murthy, Rajiv Ramnath.

**Figure 1.** Figure 1: Controllability as structured latent geometry. Controlla maps image, reference image, text, and audio into factorized identity and attribute spaces. Attribute factors follow graph-consistent semantic traversal, while the reference-grounded identity factor is held stable [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: Controlla framework. Multimodal inputs (image, reference image, text, audio) are encoded into a shared representation and factorized into attribute and identity components. Emotion and identity graph priors are aligned with these factors using graph-constrained optimal transport, enabling graph-consistent attribute traversal while preserving reference identity. factorization encoder, graph-OT latent alignm… view at source ↗

**Figure 3.** Figure 3: Effect of graph strength. Increasing λg improves controllability and human preference while reducing GC, with CLIP reported only as an auxiliary alignment diagnostic. Effect of Components and Graph Structure [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Graph-consistent vs. linear traversal. Graph-consistent traversal yields smoother, identity-preserving transitions and lower GC than linear interpolation. 4.7 Qualitative Results [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Cross-dataset qualitative comparison. Generated examples show identity preservation, expression consistency, and semantic alignment across datasets and methods under matched inputs [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: Graph-guided latent control. Structured latent geometry yields smooth emotion transitions with consistent identity, unlike linear interpolation. The formulation can support alternative graph-defined control factors beyond affective control, such as pose, expression intensity, or style; see Appendix Sec. L and Sec. M. 5 Conclusion and Future Work We presented Controlla, which models controllability as a pro… view at source ↗

**Figure 7.** Figure 7: Emotion distribution across 8 classes. AffectHuman-43K provides broad affective coverage with mild natural skew across emotion categories. Why singleton groups matter. Singleton groups prevent the benchmark from being dominated by repeated identities. They also test whether a model can preserve identity from a single reference image, which is a common real-world use case for reference-guided editing. C.4 E… view at source ↗

**Figure 8.** Figure 8: Preference boxplots. Scores compare Controlla with baselines across evaluation criteria. 34 [PITH_FULL_IMAGE:figures/full_fig_p034_8.png] view at source ↗

**Figure 9.** Figure 9: Mean preference heatmap. Average scores summarize Controlla’s gains across evaluation dimensions. E.5 Results and Analysis Across evaluation dimensions, Controlla receives consistently positive preference scores. The boxplots in [PITH_FULL_IMAGE:figures/full_fig_p035_9.png] view at source ↗

**Figure 10.** Figure 10: Preference-score distributions. Violin plots show response variability across criteria. F Statistical Significance Analysis [PITH_FULL_IMAGE:figures/full_fig_p035_10.png] view at source ↗

**Figure 11.** Figure 11: Ranked mean preferences. Criteria are ordered by average preference score [PITH_FULL_IMAGE:figures/full_fig_p036_11.png] view at source ↗

**Figure 12.** Figure 12: Demographic preference distributions. Positive scores remain stable across participant groups. F.1 Controlla Architecture Controlla is implemented as a modular factorized-control stack around a pretrained diffusion generator. The base Stable Diffusion v1.5 generator is kept frozen unless otherwise specified, while the multimodal adapters, factorization heads, graph-OT alignment module, and generator-cond… view at source ↗

**Figure 13.** Figure 13: Cross-dataset comparison across AffectHuman-43K, CelebA-HQ, and AffectNet [PITH_FULL_IMAGE:figures/full_fig_p048_13.png] view at source ↗

**Figure 14.** Figure 14: Comparison across representative baseline methods. Multi-identity evaluation [PITH_FULL_IMAGE:figures/full_fig_p048_14.png] view at source ↗

**Figure 15.** Figure 15: Multi-identity evaluation under shared target conditions. Multimodal ambiguity. Figures 19 and 20 examine scenarios with potentially conflicting or ambiguous multimodal inputs. These cases test how models integrate multiple signals when cues are not fully aligned. Across methods, outputs may vary in how different signals are prioritized. Controlla produces more consistent outputs across the presented exam… view at source ↗

**Figure 16.** Figure 16: Fine-grained variations within high-level emotion categories. This behavior reflects how graph constraints shape the latent space and influence the trade-off between controllability and diversity. J.3 Behavior Under Multimodal Ambiguity. Controlla is designed for settings in which image, text, and audio provide complementary affective cues, but in practice these signals may be partially ambiguous or weakl… view at source ↗

**Figure 17.** Figure 17: Challenging cases involving visually similar expressions [PITH_FULL_IMAGE:figures/full_fig_p051_17.png] view at source ↗

**Figure 18.** Figure 18: Challenging cases with subtle semantic differences [PITH_FULL_IMAGE:figures/full_fig_p051_18.png] view at source ↗

**Figure 19.** Figure 19: Multimodal ambiguity with partially conflicting cues [PITH_FULL_IMAGE:figures/full_fig_p052_19.png] view at source ↗

**Figure 20.** Figure 20: Multimodal ambiguity under diverse identity and expression settings. This behavior suggests that Controlla integrates multimodal signals through the learned attribute factor rather than relying on a single dominant cue. However, the model does not include an explicit conflict-resolution module. Therefore, when modalities provide strongly contradictory affective instructions, the output may reflect a compr… view at source ↗

**Figure 21.** Figure 21: Behavior under conflicting multimodal signals. Inputs include a smiling image, calm audio, and strongly negative text. The resulting outputs reflect a combination of these cues, illustrating how structured priors guide latent transitions when signals are not fully aligned. transport structure can be cached or avoided, since generation only requires the learned factorization and conditioning adapters. Seve… view at source ↗

**Figure 22.** Figure 22: Effect of graph regularization strength. Varying λ changes the balance between semantic consistency and expressive variability. Lower values produce more diverse outputs, while higher values yield more consistent but less varied expressions [PITH_FULL_IMAGE:figures/full_fig_p054_22.png] view at source ↗

**Figure 23.** Figure 23: Multimodal ambiguity. Partially conflicting image, text, and audio cues produce outputs that combine or prioritize affective evidence across modalities. K Extensibility Beyond Affective Control K.1 Framework Generality Controlla is formulated as a graph-conditioned latent control framework, rather than as an emotionspecific architecture. The method requires two abstract ingredients: (i) an attribute grap… view at source ↗

**Figure 24.** Figure 24: Pose control. A plug-in pose graph enables ordered pose traversal while preserving reference identity. Lighting control via plug-in graphs [PITH_FULL_IMAGE:figures/full_fig_p057_24.png] view at source ↗

**Figure 25.** Figure 25: Lighting control. A plug-in lighting graph enables structured illumination changes while preserving identity and content. 57 [PITH_FULL_IMAGE:figures/full_fig_p057_25.png] view at source ↗

read the original abstract

Controllable multimodal generation is commonly formulated as an inference-time conditioning problem using prompts, guidance, or auxiliary modules. While effective, such approaches do not explicitly structure how semantic attributes evolve, which can lead to identity drift and inconsistent cross-modal behavior. We propose Controlla, a modular factorized-control framework that treats controllability as a property of structured latent geometry. Controlla learns identity and attribute factors from multimodal inputs and aligns them with graph priors using graph-constrained optimal transport, encouraging attributes to follow graph-consistent trajectories while preserving reference identity. To evaluate this setting, we construct AffectHuman-43K, a leakage-aware multimodal benchmark for reference-grounded affective control, and introduce geometry-aware metrics for trajectory consistency and latent disentanglement. Experiments show consistent improvements in controllability, identity preservation, and cross-modal alignment, with additional analyses on graph sensitivity, extensibility, and robustness.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Controlla structures latent spaces with factorized factors and graph-constrained OT to reduce identity drift in multimodal generation, but the abstract gives no numbers and the graph priors need validation.

read the letter

The main point is that Controlla treats controllability as a property of the latent geometry itself rather than an add-on at inference. It factorizes identity and attribute representations from multimodal inputs, then uses graph-constrained optimal transport to push attributes along consistent trajectories while holding identity fixed. This directly targets the drift and inconsistency problems that show up with prompts or guidance alone. The paper introduces AffectHuman-43K as a leakage-aware benchmark for reference-grounded affective control and defines geometry-aware metrics for trajectory consistency and disentanglement. Those are useful additions that give concrete ways to measure the claims. The modular framing also separates the factor learning from the alignment step, which makes the approach easier to inspect or extend. The soft spots sit in the evidence. The abstract states that experiments show consistent improvements in controllability, identity preservation, and cross-modal alignment, yet supplies no quantitative results, ablation tables, or error analysis. Without those details it is hard to judge whether the gains come from the graph mechanism or from fitting to this particular benchmark. The graph priors themselves are not described in enough detail to know if they are hand-crafted, data-derived, or externally validated against real semantic relationships. If the priors are benchmark-specific, the trajectory consistency may not hold more broadly. This paper is for researchers working on controllable multimodal models who want to move beyond inference-time conditioning. A reader interested in latent space structure and evaluation metrics would find the framework and the new dataset worth examining. I would send it to peer review. The core idea is distinct enough from routine guidance methods to merit a full check, even if the authors need to add concrete results and graph justification.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes Controlla, a modular factorized-control framework for controllable multimodal generation. It learns identity and attribute factors from multimodal inputs and aligns them with graph priors via graph-constrained optimal transport, with the goal of encouraging attributes to follow graph-consistent trajectories while preserving reference identity. The authors introduce the leakage-aware AffectHuman-43K benchmark for reference-grounded affective control along with geometry-aware metrics for trajectory consistency and latent disentanglement. Experiments are reported to show consistent improvements in controllability, identity preservation, and cross-modal alignment, with additional analyses on graph sensitivity and robustness.

Significance. If the central empirical claims hold and the graph priors are demonstrated to capture verifiable semantic structure rather than benchmark-specific artifacts, the work could meaningfully advance controllable generation by treating controllability as an intrinsic property of structured latent geometry instead of relying solely on inference-time conditioning. The introduction of a new multimodal affective benchmark and geometry-aware evaluation metrics constitutes a concrete contribution to the field.

major comments (2)

[Section 3.2] Section 3.2 (Graph priors and OT alignment): The construction and validation of the graph priors for AffectHuman-43K is described at a high level only. It remains unclear whether the graphs are hand-crafted, derived from data statistics, or externally validated against cross-modal semantic trajectories (e.g., affective state transitions). This detail is load-bearing for the central claim that the method produces graph-consistent trajectories without identity drift; absent such validation, reported gains risk being attributable to fitting the specific benchmark rather than the latent-geometry mechanism.
[Section 5] Section 5 (Experiments and ablations): The manuscript reports consistent improvements but provides insufficient quantitative detail on effect sizes, standard deviations, or ablations isolating the graph-constrained OT term versus the factorized representation alone. Without these, it is difficult to assess whether the geometry-aware metrics genuinely support the controllability claims or whether gains could arise from other modeling choices.

minor comments (2)

[Abstract] Abstract: The phrase 'consistent improvements' would be strengthened by including one or two key numerical results (e.g., percentage gains on controllability or identity metrics) to give readers an immediate sense of effect magnitude.
[Section 3] Notation: The distinction between the identity factor and attribute factor embeddings could be clarified with an explicit equation or diagram early in Section 3 to avoid ambiguity when discussing the OT alignment step.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful and constructive feedback. The comments identify important areas where additional detail will strengthen the manuscript's support for the central claims regarding graph-constrained latent geometry. We address each major comment below and commit to the indicated revisions.

read point-by-point responses

Referee: [Section 3.2] Section 3.2 (Graph priors and OT alignment): The construction and validation of the graph priors for AffectHuman-43K is described at a high level only. It remains unclear whether the graphs are hand-crafted, derived from data statistics, or externally validated against cross-modal semantic trajectories (e.g., affective state transitions). This detail is load-bearing for the central claim that the method produces graph-consistent trajectories without identity drift; absent such validation, reported gains risk being attributable to fitting the specific benchmark rather than the latent-geometry mechanism.

Authors: We agree that the current description in Section 3.2 is high-level and requires expansion to substantiate the core claim. The graph priors are constructed in a data-driven manner by deriving transition probabilities from co-occurrence statistics of affective attribute labels across the multimodal samples in AffectHuman-43K. We will revise Section 3.2 to include the precise construction procedure (including the adjacency matrix computation and edge weighting), along with any steps taken to align the resulting graph with established affective transition patterns. This revision will clarify the distinction between benchmark-specific fitting and the intended latent-geometry mechanism. revision: yes
Referee: [Section 5] Section 5 (Experiments and ablations): The manuscript reports consistent improvements but provides insufficient quantitative detail on effect sizes, standard deviations, or ablations isolating the graph-constrained OT term versus the factorized representation alone. Without these, it is difficult to assess whether the geometry-aware metrics genuinely support the controllability claims or whether gains could arise from other modeling choices.

Authors: We acknowledge that the experimental reporting in Section 5 would benefit from greater quantitative rigor. While the manuscript already contains ablation variants that remove the OT alignment component, we agree that reporting effect sizes, standard deviations over multiple runs, and a more isolated comparison of the graph-constrained OT term versus the factorized representation alone would better isolate the contribution of the geometry mechanism. We will expand Section 5 with these details, including tables that report means and standard deviations for the key geometry-aware metrics and a dedicated ablation isolating the OT term. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation uses standard optimal transport and graph priors as methodological choices

full rationale

The paper frames controllability as structured latent geometry and aligns identity/attribute factors to graph priors via graph-constrained optimal transport. This is presented as a modeling decision rather than a derivation whose outputs are forced by its own fitted parameters or self-citations. No equations, predictions, or uniqueness theorems are shown to reduce by construction to inputs (e.g., no fitted parameter renamed as prediction, no self-citation load-bearing the central claim). The AffectHuman-43K benchmark and geometry-aware metrics are introduced for evaluation, not as part of a self-referential loop. The approach remains self-contained against external benchmarks and standard techniques, yielding no load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that graph priors can be constructed to represent semantically meaningful attribute trajectories and that optimal transport under those constraints will preserve identity while improving controllability.

axioms (1)

domain assumption Graph priors accurately capture semantic relationships and consistent trajectories among attributes across modalities
Invoked when the method aligns factors with graph priors to encourage graph-consistent trajectories.

pith-pipeline@v0.9.0 · 5689 in / 1193 out tokens · 53820 ms · 2026-05-20T18:34:41.497164+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/README.md (headline theorem) reality_from_one_distinction (8-tick period) echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

normalized 8-class emotion taxonomy and fine-grained valence–arousal variation

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

64 extracted references · 64 canonical work pages · 12 internal anchors

[1]

Flamingo: a visual language model for few-shot learning.Advances in neural information processing systems, 35: 23716–23736, 2022

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning.Advances in neural information processing systems, 35: 23716–23736, 2022

work page 2022
[2]

Loosecontrol: Lifting controlnet for generalized depth conditioning

Shariq Farooq Bhat, Niloy Mitra, and Peter Wonka. Loosecontrol: Lifting controlnet for generalized depth conditioning. InACM SIGGRAPH 2024 Conference Papers, pages 1–11, 2024

work page 2024
[3]

Instructpix2pix: Learning to follow image editing instructions

Tim Brooks, Aleksander Holynski, and Alexei A Efros. Instructpix2pix: Learning to follow image editing instructions. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 18392–18402, 2023

work page 2023
[4]

Emobank: Studying the impact of annotation perspective and representation format on dimensional emotion analysis

Sven Buechel and Udo Hahn. Emobank: Studying the impact of annotation perspective and representation format on dimensional emotion analysis. InProceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, pages 578–585, 2017

work page 2017
[5]

Iemocap: Interactive emotional dyadic motion capture database.Language resources and evaluation, 42(4):335–359, 2008

Carlos Busso, Murtaza Bulut, Chi-Chun Lee, Abe Kazemzadeh, Emily Mower, Samuel Kim, Jeannette N Chang, Sungbok Lee, and Shrikanth S Narayanan. Iemocap: Interactive emotional dyadic motion capture database.Language resources and evaluation, 42(4):335–359, 2008

work page 2008
[6]

PixArt-$\alpha$: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis

Junsong Chen, Jincheng Yu, Chongjian Ge, Lewei Yao, Enze Xie, Yue Wu, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, and Zhenguo Li. Pixart- α: Fast training of diffusion transformer for photorealistic text-to-image synthesis, 2023. URL https://arxiv.org/abs/ 2310.00426

work page internal anchor Pith review Pith/arXiv arXiv 2023
[7]

PaLI: A Jointly-Scaled Multilingual Language-Image Model

Xi Chen, Xiao Wang, Soravit Changpinyo, Piotr Padlewski, Daniel Salz, Sebastian Goodman, Adam Grycner, Basil Mustafa, Lucas Beyer, et al. Pali: A jointly-scaled multilingual language- image model.arXiv preprint arXiv:2209.06794, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[8]

Sinkhorn distances: Lightspeed computation of optimal transport.Advances in Neural Information Processing Systems (NeurIPS), 26, 2013

Marco Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport.Advances in Neural Information Processing Systems (NeurIPS), 26, 2013

work page 2013
[9]

Emoticrafter: Text-to-emotional-image generation based on valence-arousal model,

Shengqi Dang, Yi He, Long Ling, Ziqing Qian, Nanxuan Zhao, and Nan Cao. Emoticrafter: Text-to-emotional-image generation based on valence-arousal model, 2025. URL https: //arxiv.org/abs/2501.05710

work page arXiv 2025
[10]

Diffusionrig: Learning personalized priors for facial appearance editing

Zeyu Ding, Xingang Zhang, Zhanjie Xia, Louis Jebe, Zhuowen Tu, and Xiangyu Zhang. Diffusionrig: Learning personalized priors for facial appearance editing. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 12736–12746, 2023

work page 2023
[11]

Emoportraits: Emotion-enhanced multimodal one-shot head avatars

Nikita Drobyshev, Adriana Bigata Casademunt, Konstantinos V ougioukas, Zoe Landgraf, Stavros Petridis, and Maja Pantic. Emoportraits: Emotion-enhanced multimodal one-shot head avatars. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 8498–8507, 2024

work page 2024
[12]

Scaling Rectified Flow Transformers for High-Resolution Image Synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, Dustin Podell, Tim Dockhorn, Zion English, Kyle Lacey, Alex Goodwin, Yannik Marek, and Robin Rombach. Scaling rectified flow transformers for high-resolution image synthesis, 2024. URL https://arxiv.org/abs/ 2403.03206

work page internal anchor Pith review Pith/arXiv arXiv 2024
[13]

Gheorghe Comanici et. al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities, 2025. URL https: //arxiv.org/abs/2507.06261

work page internal anchor Pith review Pith/arXiv arXiv 2025
[14]

OpenAI et. al. Gpt-4o system card, 2024. URLhttps://arxiv.org/abs/2410.21276. 10

work page internal anchor Pith review Pith/arXiv arXiv 2024
[15]

Imagebind: One embedding space to bind them all

Rohit Girdhar, Alaaeldin El-Nouby, Zhuang Liu, Mannat Singh, Kaushal V Alwala, Armand Joulin, and Ishan Misra. Imagebind: One embedding space to bind them all. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15180–15190, 2023

work page 2023
[16]

Demystifying flux architecture.arXiv preprint arXiv:2507.09595, 2025

Or Greenberg. Demystifying flux architecture.arXiv preprint arXiv:2507.09595, 2025

work page arXiv 2025
[17]

Prompt-to-Prompt Image Editing with Cross Attention Control

Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. Prompt-to-prompt image editing with cross attention control.arXiv preprint arXiv:2208.01626, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[18]

Image generation from scene graphs

Justin Johnson, Agrim Gupta, and Li Fei-Fei. Image generation from scene graphs. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1219– 1228, 2018

work page 2018
[19]

Progressive Growing of GANs for Improved Quality, Stability, and Variation

Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive growing of gans for improved quality, stability, and variation.arXiv preprint arXiv:1710.10196, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[20]

Progressive growing of gans for improved quality, stability, and variation, 2018

Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive growing of gans for improved quality, stability, and variation, 2018. URL https://arxiv.org/abs/1710. 10196

work page 2018
[21]

Taxaadapter: Vision taxonomy models are key to fine-grained image generation over the tree of life.arXiv preprint arXiv:2603.26128, 2026

Mridul Khurana, Amin Karimi Monsefi, Justin Lee, Medha Sawhney, David Carlyn, Julia Chae, Jianyang Gu, Rajiv Ramnath, Sara Beery, Wei-Lun Chao, et al. Taxaadapter: Vision taxonomy models are key to fine-grained image generation over the tree of life.arXiv preprint arXiv:2603.26128, 2026

work page arXiv 2026
[22]

Diffusionclip: Text-guided diffusion models for robust image manipulation.arXiv preprint arXiv:2110.02711, 2022

Gihyun Kim, Taehoon Kwon, and Jong Chul Ye. Diffusionclip: Text-guided diffusion models for robust image manipulation.arXiv preprint arXiv:2110.02711, 2022

work page arXiv 2022
[23]

Neural relational inference for interacting systems

Thomas Kipf, Ethan Fetaya, Kuan-Chieh Wang, Max Welling, and Richard Zemel. Neural relational inference for interacting systems. InInternational conference on machine learning, pages 2688–2697. Pmlr, 2018

work page 2018
[24]

Black Forest Labs, Stephen Batifol, Andreas Blattmann, Frederic Boesel, Saksham Consul, Cyril Diagne, Tim Dockhorn, Jack English, Zion English, Patrick Esser, et al. Flux. 1 kontext: Flow matching for in-context image generation and editing in latent space.arXiv preprint arXiv:2506.15742, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[25]

Magicbrush: A manually annotated dataset for instruction-guided image editing

Chenlin Li, Ziyang Chen, Peiye Sun, et al. Magicbrush: A manually annotated dataset for instruction-guided image editing. InAdvances in Neural Information Processing Systems (NeurIPS), Datasets and Benchmarks Track, 2023

work page 2023
[26]

Controlnet++: Improving conditional controls with efficient consistency feedback: Project page: liming-ai

Ming Li, Taojiannan Yang, Huafeng Kuang, Jie Wu, Zhaoning Wang, Xuefeng Xiao, and Chen Chen. Controlnet++: Improving conditional controls with efficient consistency feedback: Project page: liming-ai. github. io/controlnet_plus_plus. InEuropean Conference on Computer Vision, pages 129–147. Springer, 2024

work page 2024
[27]

Reliable crowdsourcing and deep locality-preserving learning for expression recognition in the wild

Shan Li, Weihong Deng, and JunPing Du. Reliable crowdsourcing and deep locality-preserving learning for expression recognition in the wild. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 2852–2861, 2017

work page 2017
[28]

Deep learning face attributes in the wild

Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild. InProceedings of the IEEE international conference on computer vision, pages 3730–3738, 2015

work page 2015
[29]

Steven R Livingstone and Frank A Russo. The ryerson audio-visual database of emotional speech and song (ravdess): A dynamic, multimodal set of facial and vocal expressions in north american english.PloS one, 13(5):e0196391, 2018

work page 2018
[30]

Ace++: Instruction-based image creation and editing via context-aware content filling

Chaojie Mao, Jingfeng Zhang, Yulin Pan, Zeyinzi Jiang, Zhen Han, Yu Liu, and Jingren Zhou. Ace++: Instruction-based image creation and editing via context-aware content filling. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1958–1966, 2025. 11

work page 1958
[31]

Gromov-wasserstein averaging of kernel and distance matrices

Hrvoje Máreti´c, Sebastian Claici, Edward Chien, and Justin Solomon. Gromov-wasserstein averaging of kernel and distance matrices. InInternational Conference on Machine Learning (ICML), pages 4424–4433, 2019

work page 2019
[32]

Direct: Disen- tangled regularization of contrastive trajectories for physics-refined video generation.arXiv preprint arXiv:2603.25931, 2026

Abolfazl Meyarian, Amin Karimi Monsefi, Rajiv Ramnath, and Ser-Nam Lim. Direct: Disen- tangled regularization of contrastive trajectories for physics-refined video generation.arXiv preprint arXiv:2603.25931, 2026

work page arXiv 2026
[33]

Null-text inversion for editing real images using guided diffusion models

Ron Mokady, Amir Hertz, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. Null-text inversion for editing real images using guided diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 6038–6047, 2023

work page 2023
[34]

Ali Mollahosseini, Behzad Hasani, and Mohammad H. Mahoor. Affectnet: A database for facial expression, valence, and arousal computing in the wild.IEEE Transactions on Affective Computing, 10(1):18–31, January 2019. ISSN 2371-9850. doi: 10.1109/taffc.2017.2740923. URLhttp://dx.doi.org/10.1109/TAFFC.2017.2740923

work page doi:10.1109/taffc.2017.2740923 2019
[35]

FS-DFM: Fast and Accurate Long Text Generation with Few-Step Diffusion Language Models

Amin Karimi Monsefi, Nikhil Bhendawade, Manuel Rafael Ciosici, Dominic Culver, Yizhe Zhang, and Irina Belousova. Fs-dfm: Fast and accurate long text generation with few-step diffusion language models.arXiv preprint arXiv:2509.20624, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[36]

Taxadiffusion: Progressively trained diffusion model for fine-grained species generation

Amin Karimi Monsefi, Mridul Khurana, Rajiv Ramnath, Anuj Karpatne, Wei-Lun Chao, and Cheng Zhang. Taxadiffusion: Progressively trained diffusion model for fine-grained species generation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 8579–8589, 2025

work page 2025
[37]

Knobgen: Controlling the sophistication of artwork in sketch-based diffusion models

Pouyan Navard, Amin Karimi Monsefi, Mengxi Zhou, Wei-Lun Chao, Alper Yilmaz, and Rajiv Ramnath. Knobgen: Controlling the sophistication of artwork in sketch-based diffusion models. arXiv preprint arXiv:2410.01595, 2024

work page arXiv 2024
[38]

Styleclip: Text-driven manipulation of stylegan imagery

Or Patashnik, Zongze Wu, Eli Shechtman, Daniel Cohen-Or, and Dani Lischinski. Styleclip: Text-driven manipulation of stylegan imagery. InProceedings of the IEEE International Conference on Computer Vision (ICCV), pages 2085–2094, 2021

work page 2085
[39]

Flowchef: Steering of rectified flow models for controlled generations

Maitreya Patel, Song Wen, Dimitris N Metaxas, and Yezhou Yang. Flowchef: Steering of rectified flow models for controlled generations. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 15308–15318, 2025

work page 2025
[40]

Scalable Diffusion Models with Transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers, 2023. URL https://arxiv.org/abs/2212.09748

work page internal anchor Pith review Pith/arXiv arXiv 2023
[41]

Kosmos-2: Grounding Multimodal Large Language Models to the World

Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Shaohan Huang, Shuming Ma, and Furu Wei. Kosmos-2: Grounding multimodal large language models to the world, 2023. URL https://arxiv.org/abs/2306.14824

work page internal anchor Pith review Pith/arXiv arXiv 2023
[42]

Foundations and Trends in Machine Learning, 2019

Gabriel Peyré and Marco Cuturi.Computational Optimal Transport, volume 11. Foundations and Trends in Machine Learning, 2019

work page 2019
[43]

SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis, 2023. URLhttps://arxiv.org/abs/2307.01952

work page internal anchor Pith review Pith/arXiv arXiv 2023
[44]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamila Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational Conference on Machine Learning (ICML), pages 8748–8763. PMLR, 2021

work page 2021
[45]

Diffuse everything: Multimodal diffusion models on arbitrary state spaces.arXiv preprint arXiv:2506.07903, 2025

Kevin Rojas, Yuchen Zhu, Sichen Zhu, Felix X-F Ye, and Molei Tao. Diffuse everything: Multimodal diffusion models on arbitrary state spaces.arXiv preprint arXiv:2506.07903, 2025

work page arXiv 2025
[46]

High- resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High- resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10684–10695, 2022. 12

work page 2022
[47]

Dreambooth: Fine-tuning text-to-image diffusion models for subject-driven generation

Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. Dreambooth: Fine-tuning text-to-image diffusion models for subject-driven generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 22500–22510, 2023

work page 2023
[48]

Fleet, and Mohammad Norouzi

Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Denton, Seyed Kamyar Seyed Ghasemipour, Burcu Karagol Ayan, Sharif Mahdavi, Raphael Lopes, Tim Salimans, Jonathan Ho, David J. Fleet, and Mohammad Norouzi. Imagen: Photorealistic text-to-image diffusion models with deep language understanding. InAdvances in Neural Information Process...

work page 2022
[49]

Emu edit: Precise image editing via recognition and generation tasks

Shelly Sheynin, Adam Polyak, Uriel Singer, Yuval Kirstain, Amit Zohar, Oron Ashual, Devi Parikh, and Yaniv Taigman. Emu edit: Precise image editing via recognition and generation tasks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8871–8879, 2024

work page 2024
[50]

Plug-and-play diffusion features for text-driven image-to-image translation

Narek Tumanyan, Michal Geyer, Shai Bagon, and Tali Dekel. Plug-and-play diffusion features for text-driven image-to-image translation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1921–1930, 2023

work page 1921
[51]

Optimal transport for structured data with application on graphs.Proceedings of the 37th International Conference on Machine Learning, 2020

Titouan Vayer, Laetitia Chapel, Rémi Flamary, Romain Tavenard, and Nicolas Courty. Optimal transport for structured data with application on graphs.Proceedings of the 37th International Conference on Machine Learning, 2020

work page 2020
[52]

Optimal transport: old and new.Grundlehren der Mathematischen Wis- senschaften, 338, 2008

Cédric Villani. Optimal transport: old and new.Grundlehren der Mathematischen Wis- senschaften, 338, 2008

work page 2008
[53]

Tam- ing rectified flow for inversion and editing

Jiangshan Wang, Junfu Pu, Zhongang Qi, Jiayi Guo, Yue Ma, Nisha Huang, Yuxin Chen, Xiu Li, and Ying Shan. Taming rectified flow for inversion and editing.arXiv preprint arXiv:2411.04746, 2024

work page arXiv 2024
[54]

Graph-based unsupervised disentangled representation learning via multimodal large language models, 2024

Baao Xie, Qiuyu Chen, Yunnan Wang, Zequn Zhang, Xin Jin, and Wenjun Zeng. Graph-based unsupervised disentangled representation learning via multimodal large language models, 2024. URLhttps://arxiv.org/abs/2407.18999

work page arXiv 2024
[55]

Gromov-wasserstein learning for graph matching and node embedding

Hongteng Xu, Dixin Luo, and Lawrence Carin. Gromov-wasserstein learning for graph matching and node embedding. InInternational Conference on Machine Learning (ICML), pages 6932– 6941, 2019

work page 2019
[56]

Emogen: Emotional image content generation with text-to-image diffusion models

Jingyuan Yang, Jiashi Feng, and Hui Huang. Emogen: Emotional image content generation with text-to-image diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 6358–6368, 2024

work page 2024
[57]

Emoedit: Evoking emotions through image manipulation

Jingyuan Yang, Jiashi Feng, Wei Luo, Dani Lischinski, Daniel Cohen-Or, and Hui Huang. Emoedit: Evoking emotions through image manipulation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 24690–24699, 2025

work page 2025
[58]

Probability density geodesics in image diffusion latent space

Qingtao Yu, Jaskirat Singh, Zhaoyuan Yang, Peter Henry Tu, Jing Zhang, Hongdong Li, Richard Hartley, and Dylan Campbell. Probability density geodesics in image diffusion latent space. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 27989–27998, 2025

work page 2025
[59]

Dreamtalk: Diffusion-based realistic emotional audio-driven method for single image talking face generation.arXiv preprint arXiv:2312.13578, 2023

Chao Zhang, Chen Zhang, Meng Zhang, and In So Kweon. Dreamtalk: Diffusion-based realistic emotional audio-driven method for single image talking face generation.arXiv preprint arXiv:2312.13578, 2023

work page arXiv 2023
[60]

Magicbrush: A manually annotated dataset for instruction-guided image editing.Advances in Neural Information Processing Systems, 36:31428–31449, 2023

Kai Zhang, Lingbo Mo, Wenhu Chen, Huan Sun, and Yu Su. Magicbrush: A manually annotated dataset for instruction-guided image editing.Advances in Neural Information Processing Systems, 36:31428–31449, 2023

work page 2023
[61]

Adding conditional control to text-to-image diffusion models

Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 3836–3847, 2023. 13

work page 2023
[62]

Enabling instructional image editing with in-context generation in large scale diffusion transformer

Zechuan Zhang, Ji Xie, Yu Lu, Zongxin Yang, and Yi Yang. Enabling instructional image editing with in-context generation in large scale diffusion transformer. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

work page 2025
[63]

Ultraedit: Instruction-based fine-grained image editing at scale.Advances in Neural Information Processing Systems, 37:3058–3093, 2024

Haozhe Zhao, Xiaojian Ma, Liang Chen, Shuzheng Si, Rujie Wu, Kaikai An, Peiyu Yu, Minjia Zhang, Qing Li, and Baobao Chang. Ultraedit: Instruction-based fine-grained image editing at scale.Advances in Neural Information Processing Systems, 37:3058–3093, 2024

work page 2024
[64]

Bidedpo: Conditional image generation with simultaneous text and condition alignment.arXiv preprint arXiv:2511.19268, 2025

Dewei Zhou, Mingwei Li, Zongxin Yang, Yu Lu, Yunqiu Xu, Zhizhong Wang, Zeyi Huang, and Yi Yang. Bidedpo: Conditional image generation with simultaneous text and condition alignment.arXiv preprint arXiv:2511.19268, 2025. 14 Appendix Appendix Contents A Theoretical Analysis of Controlla 17 A.1 Structured Controllability as Metric Preservation . . . . . . . ...

work page arXiv 2025

[1] [1]

Flamingo: a visual language model for few-shot learning.Advances in neural information processing systems, 35: 23716–23736, 2022

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning.Advances in neural information processing systems, 35: 23716–23736, 2022

work page 2022

[2] [2]

Loosecontrol: Lifting controlnet for generalized depth conditioning

Shariq Farooq Bhat, Niloy Mitra, and Peter Wonka. Loosecontrol: Lifting controlnet for generalized depth conditioning. InACM SIGGRAPH 2024 Conference Papers, pages 1–11, 2024

work page 2024

[3] [3]

Instructpix2pix: Learning to follow image editing instructions

Tim Brooks, Aleksander Holynski, and Alexei A Efros. Instructpix2pix: Learning to follow image editing instructions. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 18392–18402, 2023

work page 2023

[4] [4]

Emobank: Studying the impact of annotation perspective and representation format on dimensional emotion analysis

Sven Buechel and Udo Hahn. Emobank: Studying the impact of annotation perspective and representation format on dimensional emotion analysis. InProceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, pages 578–585, 2017

work page 2017

[5] [5]

Iemocap: Interactive emotional dyadic motion capture database.Language resources and evaluation, 42(4):335–359, 2008

Carlos Busso, Murtaza Bulut, Chi-Chun Lee, Abe Kazemzadeh, Emily Mower, Samuel Kim, Jeannette N Chang, Sungbok Lee, and Shrikanth S Narayanan. Iemocap: Interactive emotional dyadic motion capture database.Language resources and evaluation, 42(4):335–359, 2008

work page 2008

[6] [6]

PixArt-$\alpha$: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis

Junsong Chen, Jincheng Yu, Chongjian Ge, Lewei Yao, Enze Xie, Yue Wu, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, and Zhenguo Li. Pixart- α: Fast training of diffusion transformer for photorealistic text-to-image synthesis, 2023. URL https://arxiv.org/abs/ 2310.00426

work page internal anchor Pith review Pith/arXiv arXiv 2023

[7] [7]

PaLI: A Jointly-Scaled Multilingual Language-Image Model

Xi Chen, Xiao Wang, Soravit Changpinyo, Piotr Padlewski, Daniel Salz, Sebastian Goodman, Adam Grycner, Basil Mustafa, Lucas Beyer, et al. Pali: A jointly-scaled multilingual language- image model.arXiv preprint arXiv:2209.06794, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[8] [8]

Sinkhorn distances: Lightspeed computation of optimal transport.Advances in Neural Information Processing Systems (NeurIPS), 26, 2013

Marco Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport.Advances in Neural Information Processing Systems (NeurIPS), 26, 2013

work page 2013

[9] [9]

Emoticrafter: Text-to-emotional-image generation based on valence-arousal model,

Shengqi Dang, Yi He, Long Ling, Ziqing Qian, Nanxuan Zhao, and Nan Cao. Emoticrafter: Text-to-emotional-image generation based on valence-arousal model, 2025. URL https: //arxiv.org/abs/2501.05710

work page arXiv 2025

[10] [10]

Diffusionrig: Learning personalized priors for facial appearance editing

Zeyu Ding, Xingang Zhang, Zhanjie Xia, Louis Jebe, Zhuowen Tu, and Xiangyu Zhang. Diffusionrig: Learning personalized priors for facial appearance editing. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 12736–12746, 2023

work page 2023

[11] [11]

Emoportraits: Emotion-enhanced multimodal one-shot head avatars

Nikita Drobyshev, Adriana Bigata Casademunt, Konstantinos V ougioukas, Zoe Landgraf, Stavros Petridis, and Maja Pantic. Emoportraits: Emotion-enhanced multimodal one-shot head avatars. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 8498–8507, 2024

work page 2024

[12] [12]

Scaling Rectified Flow Transformers for High-Resolution Image Synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, Dustin Podell, Tim Dockhorn, Zion English, Kyle Lacey, Alex Goodwin, Yannik Marek, and Robin Rombach. Scaling rectified flow transformers for high-resolution image synthesis, 2024. URL https://arxiv.org/abs/ 2403.03206

work page internal anchor Pith review Pith/arXiv arXiv 2024

[13] [13]

Gheorghe Comanici et. al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities, 2025. URL https: //arxiv.org/abs/2507.06261

work page internal anchor Pith review Pith/arXiv arXiv 2025

[14] [14]

OpenAI et. al. Gpt-4o system card, 2024. URLhttps://arxiv.org/abs/2410.21276. 10

work page internal anchor Pith review Pith/arXiv arXiv 2024

[15] [15]

Imagebind: One embedding space to bind them all

Rohit Girdhar, Alaaeldin El-Nouby, Zhuang Liu, Mannat Singh, Kaushal V Alwala, Armand Joulin, and Ishan Misra. Imagebind: One embedding space to bind them all. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15180–15190, 2023

work page 2023

[16] [16]

Demystifying flux architecture.arXiv preprint arXiv:2507.09595, 2025

Or Greenberg. Demystifying flux architecture.arXiv preprint arXiv:2507.09595, 2025

work page arXiv 2025

[17] [17]

Prompt-to-Prompt Image Editing with Cross Attention Control

Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. Prompt-to-prompt image editing with cross attention control.arXiv preprint arXiv:2208.01626, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[18] [18]

Image generation from scene graphs

Justin Johnson, Agrim Gupta, and Li Fei-Fei. Image generation from scene graphs. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1219– 1228, 2018

work page 2018

[19] [19]

Progressive Growing of GANs for Improved Quality, Stability, and Variation

Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive growing of gans for improved quality, stability, and variation.arXiv preprint arXiv:1710.10196, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[20] [20]

Progressive growing of gans for improved quality, stability, and variation, 2018

Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive growing of gans for improved quality, stability, and variation, 2018. URL https://arxiv.org/abs/1710. 10196

work page 2018

[21] [21]

Taxaadapter: Vision taxonomy models are key to fine-grained image generation over the tree of life.arXiv preprint arXiv:2603.26128, 2026

Mridul Khurana, Amin Karimi Monsefi, Justin Lee, Medha Sawhney, David Carlyn, Julia Chae, Jianyang Gu, Rajiv Ramnath, Sara Beery, Wei-Lun Chao, et al. Taxaadapter: Vision taxonomy models are key to fine-grained image generation over the tree of life.arXiv preprint arXiv:2603.26128, 2026

work page arXiv 2026

[22] [22]

Diffusionclip: Text-guided diffusion models for robust image manipulation.arXiv preprint arXiv:2110.02711, 2022

Gihyun Kim, Taehoon Kwon, and Jong Chul Ye. Diffusionclip: Text-guided diffusion models for robust image manipulation.arXiv preprint arXiv:2110.02711, 2022

work page arXiv 2022

[23] [23]

Neural relational inference for interacting systems

Thomas Kipf, Ethan Fetaya, Kuan-Chieh Wang, Max Welling, and Richard Zemel. Neural relational inference for interacting systems. InInternational conference on machine learning, pages 2688–2697. Pmlr, 2018

work page 2018

[24] [24]

Black Forest Labs, Stephen Batifol, Andreas Blattmann, Frederic Boesel, Saksham Consul, Cyril Diagne, Tim Dockhorn, Jack English, Zion English, Patrick Esser, et al. Flux. 1 kontext: Flow matching for in-context image generation and editing in latent space.arXiv preprint arXiv:2506.15742, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[25] [25]

Magicbrush: A manually annotated dataset for instruction-guided image editing

Chenlin Li, Ziyang Chen, Peiye Sun, et al. Magicbrush: A manually annotated dataset for instruction-guided image editing. InAdvances in Neural Information Processing Systems (NeurIPS), Datasets and Benchmarks Track, 2023

work page 2023

[26] [26]

Controlnet++: Improving conditional controls with efficient consistency feedback: Project page: liming-ai

Ming Li, Taojiannan Yang, Huafeng Kuang, Jie Wu, Zhaoning Wang, Xuefeng Xiao, and Chen Chen. Controlnet++: Improving conditional controls with efficient consistency feedback: Project page: liming-ai. github. io/controlnet_plus_plus. InEuropean Conference on Computer Vision, pages 129–147. Springer, 2024

work page 2024

[27] [27]

Reliable crowdsourcing and deep locality-preserving learning for expression recognition in the wild

Shan Li, Weihong Deng, and JunPing Du. Reliable crowdsourcing and deep locality-preserving learning for expression recognition in the wild. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 2852–2861, 2017

work page 2017

[28] [28]

Deep learning face attributes in the wild

Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild. InProceedings of the IEEE international conference on computer vision, pages 3730–3738, 2015

work page 2015

[29] [29]

Steven R Livingstone and Frank A Russo. The ryerson audio-visual database of emotional speech and song (ravdess): A dynamic, multimodal set of facial and vocal expressions in north american english.PloS one, 13(5):e0196391, 2018

work page 2018

[30] [30]

Ace++: Instruction-based image creation and editing via context-aware content filling

Chaojie Mao, Jingfeng Zhang, Yulin Pan, Zeyinzi Jiang, Zhen Han, Yu Liu, and Jingren Zhou. Ace++: Instruction-based image creation and editing via context-aware content filling. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1958–1966, 2025. 11

work page 1958

[31] [31]

Gromov-wasserstein averaging of kernel and distance matrices

Hrvoje Máreti´c, Sebastian Claici, Edward Chien, and Justin Solomon. Gromov-wasserstein averaging of kernel and distance matrices. InInternational Conference on Machine Learning (ICML), pages 4424–4433, 2019

work page 2019

[32] [32]

Direct: Disen- tangled regularization of contrastive trajectories for physics-refined video generation.arXiv preprint arXiv:2603.25931, 2026

Abolfazl Meyarian, Amin Karimi Monsefi, Rajiv Ramnath, and Ser-Nam Lim. Direct: Disen- tangled regularization of contrastive trajectories for physics-refined video generation.arXiv preprint arXiv:2603.25931, 2026

work page arXiv 2026

[33] [33]

Null-text inversion for editing real images using guided diffusion models

Ron Mokady, Amir Hertz, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. Null-text inversion for editing real images using guided diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 6038–6047, 2023

work page 2023

[34] [34]

Ali Mollahosseini, Behzad Hasani, and Mohammad H. Mahoor. Affectnet: A database for facial expression, valence, and arousal computing in the wild.IEEE Transactions on Affective Computing, 10(1):18–31, January 2019. ISSN 2371-9850. doi: 10.1109/taffc.2017.2740923. URLhttp://dx.doi.org/10.1109/TAFFC.2017.2740923

work page doi:10.1109/taffc.2017.2740923 2019

[35] [35]

FS-DFM: Fast and Accurate Long Text Generation with Few-Step Diffusion Language Models

Amin Karimi Monsefi, Nikhil Bhendawade, Manuel Rafael Ciosici, Dominic Culver, Yizhe Zhang, and Irina Belousova. Fs-dfm: Fast and accurate long text generation with few-step diffusion language models.arXiv preprint arXiv:2509.20624, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[36] [36]

Taxadiffusion: Progressively trained diffusion model for fine-grained species generation

Amin Karimi Monsefi, Mridul Khurana, Rajiv Ramnath, Anuj Karpatne, Wei-Lun Chao, and Cheng Zhang. Taxadiffusion: Progressively trained diffusion model for fine-grained species generation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 8579–8589, 2025

work page 2025

[37] [37]

Knobgen: Controlling the sophistication of artwork in sketch-based diffusion models

Pouyan Navard, Amin Karimi Monsefi, Mengxi Zhou, Wei-Lun Chao, Alper Yilmaz, and Rajiv Ramnath. Knobgen: Controlling the sophistication of artwork in sketch-based diffusion models. arXiv preprint arXiv:2410.01595, 2024

work page arXiv 2024

[38] [38]

Styleclip: Text-driven manipulation of stylegan imagery

Or Patashnik, Zongze Wu, Eli Shechtman, Daniel Cohen-Or, and Dani Lischinski. Styleclip: Text-driven manipulation of stylegan imagery. InProceedings of the IEEE International Conference on Computer Vision (ICCV), pages 2085–2094, 2021

work page 2085

[39] [39]

Flowchef: Steering of rectified flow models for controlled generations

Maitreya Patel, Song Wen, Dimitris N Metaxas, and Yezhou Yang. Flowchef: Steering of rectified flow models for controlled generations. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 15308–15318, 2025

work page 2025

[40] [40]

Scalable Diffusion Models with Transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers, 2023. URL https://arxiv.org/abs/2212.09748

work page internal anchor Pith review Pith/arXiv arXiv 2023

[41] [41]

Kosmos-2: Grounding Multimodal Large Language Models to the World

Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Shaohan Huang, Shuming Ma, and Furu Wei. Kosmos-2: Grounding multimodal large language models to the world, 2023. URL https://arxiv.org/abs/2306.14824

work page internal anchor Pith review Pith/arXiv arXiv 2023

[42] [42]

Foundations and Trends in Machine Learning, 2019

Gabriel Peyré and Marco Cuturi.Computational Optimal Transport, volume 11. Foundations and Trends in Machine Learning, 2019

work page 2019

[43] [43]

SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis, 2023. URLhttps://arxiv.org/abs/2307.01952

work page internal anchor Pith review Pith/arXiv arXiv 2023

[44] [44]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamila Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational Conference on Machine Learning (ICML), pages 8748–8763. PMLR, 2021

work page 2021

[45] [45]

Diffuse everything: Multimodal diffusion models on arbitrary state spaces.arXiv preprint arXiv:2506.07903, 2025

Kevin Rojas, Yuchen Zhu, Sichen Zhu, Felix X-F Ye, and Molei Tao. Diffuse everything: Multimodal diffusion models on arbitrary state spaces.arXiv preprint arXiv:2506.07903, 2025

work page arXiv 2025

[46] [46]

High- resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High- resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10684–10695, 2022. 12

work page 2022

[47] [47]

Dreambooth: Fine-tuning text-to-image diffusion models for subject-driven generation

Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. Dreambooth: Fine-tuning text-to-image diffusion models for subject-driven generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 22500–22510, 2023

work page 2023

[48] [48]

Fleet, and Mohammad Norouzi

Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Denton, Seyed Kamyar Seyed Ghasemipour, Burcu Karagol Ayan, Sharif Mahdavi, Raphael Lopes, Tim Salimans, Jonathan Ho, David J. Fleet, and Mohammad Norouzi. Imagen: Photorealistic text-to-image diffusion models with deep language understanding. InAdvances in Neural Information Process...

work page 2022

[49] [49]

Emu edit: Precise image editing via recognition and generation tasks

Shelly Sheynin, Adam Polyak, Uriel Singer, Yuval Kirstain, Amit Zohar, Oron Ashual, Devi Parikh, and Yaniv Taigman. Emu edit: Precise image editing via recognition and generation tasks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8871–8879, 2024

work page 2024

[50] [50]

Plug-and-play diffusion features for text-driven image-to-image translation

Narek Tumanyan, Michal Geyer, Shai Bagon, and Tali Dekel. Plug-and-play diffusion features for text-driven image-to-image translation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1921–1930, 2023

work page 1921

[51] [51]

Optimal transport for structured data with application on graphs.Proceedings of the 37th International Conference on Machine Learning, 2020

Titouan Vayer, Laetitia Chapel, Rémi Flamary, Romain Tavenard, and Nicolas Courty. Optimal transport for structured data with application on graphs.Proceedings of the 37th International Conference on Machine Learning, 2020

work page 2020

[52] [52]

Optimal transport: old and new.Grundlehren der Mathematischen Wis- senschaften, 338, 2008

Cédric Villani. Optimal transport: old and new.Grundlehren der Mathematischen Wis- senschaften, 338, 2008

work page 2008

[53] [53]

Tam- ing rectified flow for inversion and editing

Jiangshan Wang, Junfu Pu, Zhongang Qi, Jiayi Guo, Yue Ma, Nisha Huang, Yuxin Chen, Xiu Li, and Ying Shan. Taming rectified flow for inversion and editing.arXiv preprint arXiv:2411.04746, 2024

work page arXiv 2024

[54] [54]

Graph-based unsupervised disentangled representation learning via multimodal large language models, 2024

Baao Xie, Qiuyu Chen, Yunnan Wang, Zequn Zhang, Xin Jin, and Wenjun Zeng. Graph-based unsupervised disentangled representation learning via multimodal large language models, 2024. URLhttps://arxiv.org/abs/2407.18999

work page arXiv 2024

[55] [55]

Gromov-wasserstein learning for graph matching and node embedding

Hongteng Xu, Dixin Luo, and Lawrence Carin. Gromov-wasserstein learning for graph matching and node embedding. InInternational Conference on Machine Learning (ICML), pages 6932– 6941, 2019

work page 2019

[56] [56]

Emogen: Emotional image content generation with text-to-image diffusion models

Jingyuan Yang, Jiashi Feng, and Hui Huang. Emogen: Emotional image content generation with text-to-image diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 6358–6368, 2024

work page 2024

[57] [57]

Emoedit: Evoking emotions through image manipulation

Jingyuan Yang, Jiashi Feng, Wei Luo, Dani Lischinski, Daniel Cohen-Or, and Hui Huang. Emoedit: Evoking emotions through image manipulation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 24690–24699, 2025

work page 2025

[58] [58]

Probability density geodesics in image diffusion latent space

Qingtao Yu, Jaskirat Singh, Zhaoyuan Yang, Peter Henry Tu, Jing Zhang, Hongdong Li, Richard Hartley, and Dylan Campbell. Probability density geodesics in image diffusion latent space. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 27989–27998, 2025

work page 2025

[59] [59]

Dreamtalk: Diffusion-based realistic emotional audio-driven method for single image talking face generation.arXiv preprint arXiv:2312.13578, 2023

Chao Zhang, Chen Zhang, Meng Zhang, and In So Kweon. Dreamtalk: Diffusion-based realistic emotional audio-driven method for single image talking face generation.arXiv preprint arXiv:2312.13578, 2023

work page arXiv 2023

[60] [60]

Magicbrush: A manually annotated dataset for instruction-guided image editing.Advances in Neural Information Processing Systems, 36:31428–31449, 2023

Kai Zhang, Lingbo Mo, Wenhu Chen, Huan Sun, and Yu Su. Magicbrush: A manually annotated dataset for instruction-guided image editing.Advances in Neural Information Processing Systems, 36:31428–31449, 2023

work page 2023

[61] [61]

Adding conditional control to text-to-image diffusion models

Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 3836–3847, 2023. 13

work page 2023

[62] [62]

Enabling instructional image editing with in-context generation in large scale diffusion transformer

Zechuan Zhang, Ji Xie, Yu Lu, Zongxin Yang, and Yi Yang. Enabling instructional image editing with in-context generation in large scale diffusion transformer. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

work page 2025

[63] [63]

Ultraedit: Instruction-based fine-grained image editing at scale.Advances in Neural Information Processing Systems, 37:3058–3093, 2024

Haozhe Zhao, Xiaojian Ma, Liang Chen, Shuzheng Si, Rujie Wu, Kaikai An, Peiyu Yu, Minjia Zhang, Qing Li, and Baobao Chang. Ultraedit: Instruction-based fine-grained image editing at scale.Advances in Neural Information Processing Systems, 37:3058–3093, 2024

work page 2024

[64] [64]

Bidedpo: Conditional image generation with simultaneous text and condition alignment.arXiv preprint arXiv:2511.19268, 2025

Dewei Zhou, Mingwei Li, Zongxin Yang, Yu Lu, Yunqiu Xu, Zhizhong Wang, Zeyi Huang, and Yi Yang. Bidedpo: Conditional image generation with simultaneous text and condition alignment.arXiv preprint arXiv:2511.19268, 2025. 14 Appendix Appendix Contents A Theoretical Analysis of Controlla 17 A.1 Structured Controllability as Metric Preservation . . . . . . . ...

work page arXiv 2025