pith. sign in

arxiv: 2605.16603 · v1 · pith:LHZARPLUnew · submitted 2026-05-15 · 💻 cs.CV

Controlla: Learning Controllability via Graph-Constrained Latent Geometry

Pith reviewed 2026-05-20 18:34 UTC · model grok-4.3

classification 💻 cs.CV
keywords controllable multimodal generationlatent geometrygraph priorsoptimal transportidentity preservationaffective controltrajectory consistency
0
0 comments X

The pith

Controlla structures latent geometry with graph priors so attributes evolve along consistent paths while identity stays fixed.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes that controllability in multimodal generation can be achieved by learning identity and attribute factors and aligning them to graph priors through graph-constrained optimal transport. This approach encourages attributes to change in ways that respect predefined semantic relationships across different input types. A sympathetic reader would care because current conditioning methods often cause identity to drift or produce inconsistent behaviors when switching between modalities. By making the latent space follow graph-consistent trajectories, the method aims to produce more reliable control without needing extra guidance at inference time. The authors introduce a new benchmark called AffectHuman-43K to test reference-grounded affective control and metrics that check trajectory consistency and disentanglement.

Core claim

Controlla is a modular factorized-control framework that treats controllability as a property of structured latent geometry. It learns identity and attribute factors from multimodal inputs and aligns them with graph priors using graph-constrained optimal transport. This encourages attributes to follow graph-consistent trajectories while preserving reference identity. Experiments on the AffectHuman-43K benchmark demonstrate improvements in controllability, identity preservation, and cross-modal alignment.

What carries the argument

Graph-constrained optimal transport, which aligns learned identity and attribute factors with graph priors to enforce consistent trajectories in latent space.

If this is right

  • Attributes follow graph-consistent trajectories across modalities.
  • Identity is preserved better during control operations.
  • Cross-modal alignment improves in generated outputs.
  • The framework shows robustness and extensibility according to the reported analyses.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If similar graph priors can be defined for other attribute sets, the alignment step might transfer to tasks such as expression control in video or pose manipulation in 3D.
  • The explicit separation of identity and attribute factors could support modular systems where one factor is held constant while another is adjusted independently.
  • Geometry-aware evaluation metrics could be applied to test disentanglement in other generative models that operate on multimodal inputs.

Load-bearing premise

The graph priors correctly encode the semantic relationships and trajectories among attributes across modalities.

What would settle it

An observation that attribute trajectories in the latent space fail to match the graph structure or that identity preservation metrics show no gain over baseline conditioning methods on the AffectHuman-43K benchmark would disprove the claim.

Figures

Figures reproduced from arXiv: 2605.16603 by Amin Karimi Monsefi, Jamuna S. Murthy, Rajiv Ramnath.

Figure 1
Figure 1. Figure 1: Controllability as structured latent geometry. Controlla maps image, reference image, text, and audio into factorized identity and attribute spaces. Attribute factors follow graph-consistent semantic traversal, while the reference-grounded identity factor is held stable [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Controlla framework. Multimodal inputs (image, reference image, text, audio) are encoded into a shared representation and factorized into attribute and identity components. Emotion and identity graph priors are aligned with these factors using graph-constrained optimal transport, enabling graph-consistent attribute traversal while preserving reference identity. factorization encoder, graph-OT latent alignm… view at source ↗
Figure 3
Figure 3. Figure 3: Effect of graph strength. Increasing λg improves controllability and human preference while reducing GC, with CLIP reported only as an auxiliary alignment diagnostic. Effect of Components and Graph Structure [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Graph-consistent vs. linear traversal. Graph-consistent traversal yields smoother, identity-preserving transitions and lower GC than linear interpolation. 4.7 Qualitative Results [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Cross-dataset qualitative comparison. Generated examples show identity preservation, expression consistency, and semantic alignment across datasets and methods under matched inputs [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Graph-guided latent control. Structured latent geometry yields smooth emotion transitions with consistent identity, unlike linear interpolation. The formulation can support alternative graph-defined control factors beyond affective control, such as pose, expression intensity, or style; see Appendix Sec. L and Sec. M. 5 Conclusion and Future Work We presented Controlla, which models controllability as a pro… view at source ↗
Figure 7
Figure 7. Figure 7: Emotion distribution across 8 classes. AffectHuman-43K provides broad affective coverage with mild natural skew across emotion categories. Why singleton groups matter. Singleton groups prevent the benchmark from being dominated by repeated identities. They also test whether a model can preserve identity from a single reference image, which is a common real-world use case for reference-guided editing. C.4 E… view at source ↗
Figure 8
Figure 8. Figure 8: Preference boxplots. Scores compare Controlla with baselines across evaluation criteria. 34 [PITH_FULL_IMAGE:figures/full_fig_p034_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Mean preference heatmap. Average scores summarize Controlla’s gains across evaluation dimensions. E.5 Results and Analysis Across evaluation dimensions, Controlla receives consistently positive preference scores. The boxplots in [PITH_FULL_IMAGE:figures/full_fig_p035_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Preference-score distributions. Violin plots show response variability across criteria. F Statistical Significance Analysis [PITH_FULL_IMAGE:figures/full_fig_p035_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Ranked mean preferences. Criteria are ordered by average preference score [PITH_FULL_IMAGE:figures/full_fig_p036_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Demographic preference distributions. Positive scores remain stable across participant groups. F.1 Controlla Architecture Controlla is implemented as a modular factorized-control stack around a pretrained diffusion gener￾ator. The base Stable Diffusion v1.5 generator is kept frozen unless otherwise specified, while the multimodal adapters, factorization heads, graph-OT alignment module, and generator-cond… view at source ↗
Figure 13
Figure 13. Figure 13: Cross-dataset comparison across AffectHuman-43K, CelebA-HQ, and AffectNet [PITH_FULL_IMAGE:figures/full_fig_p048_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Comparison across representative baseline methods. Multi-identity evaluation [PITH_FULL_IMAGE:figures/full_fig_p048_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Multi-identity evaluation under shared target conditions. Multimodal ambiguity. Figures 19 and 20 examine scenarios with potentially conflicting or ambiguous multimodal inputs. These cases test how models integrate multiple signals when cues are not fully aligned. Across methods, outputs may vary in how different signals are prioritized. Controlla produces more consistent outputs across the presented exam… view at source ↗
Figure 16
Figure 16. Figure 16: Fine-grained variations within high-level emotion categories. This behavior reflects how graph constraints shape the latent space and influence the trade-off between controllability and diversity. J.3 Behavior Under Multimodal Ambiguity. Controlla is designed for settings in which image, text, and audio provide complementary affective cues, but in practice these signals may be partially ambiguous or weakl… view at source ↗
Figure 17
Figure 17. Figure 17: Challenging cases involving visually similar expressions [PITH_FULL_IMAGE:figures/full_fig_p051_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Challenging cases with subtle semantic differences [PITH_FULL_IMAGE:figures/full_fig_p051_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Multimodal ambiguity with partially conflicting cues [PITH_FULL_IMAGE:figures/full_fig_p052_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: Multimodal ambiguity under diverse identity and expression settings. This behavior suggests that Controlla integrates multimodal signals through the learned attribute factor rather than relying on a single dominant cue. However, the model does not include an explicit conflict-resolution module. Therefore, when modalities provide strongly contradictory affective instructions, the output may reflect a compr… view at source ↗
Figure 21
Figure 21. Figure 21: Behavior under conflicting multimodal signals. Inputs include a smiling image, calm audio, and strongly negative text. The resulting outputs reflect a combination of these cues, illustrating how structured priors guide latent transitions when signals are not fully aligned. transport structure can be cached or avoided, since generation only requires the learned factorization and conditioning adapters. Seve… view at source ↗
Figure 22
Figure 22. Figure 22: Effect of graph regularization strength. Varying λ changes the balance between semantic consistency and expressive variability. Lower values produce more diverse outputs, while higher values yield more consistent but less varied expressions [PITH_FULL_IMAGE:figures/full_fig_p054_22.png] view at source ↗
Figure 23
Figure 23. Figure 23: Multimodal ambiguity. Partially conflicting image, text, and audio cues produce outputs that combine or prioritize affective evidence across modalities. K Extensibility Beyond Affective Control K.1 Framework Generality Controlla is formulated as a graph-conditioned latent control framework, rather than as an emotion￾specific architecture. The method requires two abstract ingredients: (i) an attribute grap… view at source ↗
Figure 24
Figure 24. Figure 24: Pose control. A plug-in pose graph enables ordered pose traversal while preserving reference identity. Lighting control via plug-in graphs [PITH_FULL_IMAGE:figures/full_fig_p057_24.png] view at source ↗
Figure 25
Figure 25. Figure 25: Lighting control. A plug-in lighting graph enables structured illumination changes while preserving identity and content. 57 [PITH_FULL_IMAGE:figures/full_fig_p057_25.png] view at source ↗
read the original abstract

Controllable multimodal generation is commonly formulated as an inference-time conditioning problem using prompts, guidance, or auxiliary modules. While effective, such approaches do not explicitly structure how semantic attributes evolve, which can lead to identity drift and inconsistent cross-modal behavior. We propose Controlla, a modular factorized-control framework that treats controllability as a property of structured latent geometry. Controlla learns identity and attribute factors from multimodal inputs and aligns them with graph priors using graph-constrained optimal transport, encouraging attributes to follow graph-consistent trajectories while preserving reference identity. To evaluate this setting, we construct AffectHuman-43K, a leakage-aware multimodal benchmark for reference-grounded affective control, and introduce geometry-aware metrics for trajectory consistency and latent disentanglement. Experiments show consistent improvements in controllability, identity preservation, and cross-modal alignment, with additional analyses on graph sensitivity, extensibility, and robustness.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes Controlla, a modular factorized-control framework for controllable multimodal generation. It learns identity and attribute factors from multimodal inputs and aligns them with graph priors via graph-constrained optimal transport, with the goal of encouraging attributes to follow graph-consistent trajectories while preserving reference identity. The authors introduce the leakage-aware AffectHuman-43K benchmark for reference-grounded affective control along with geometry-aware metrics for trajectory consistency and latent disentanglement. Experiments are reported to show consistent improvements in controllability, identity preservation, and cross-modal alignment, with additional analyses on graph sensitivity and robustness.

Significance. If the central empirical claims hold and the graph priors are demonstrated to capture verifiable semantic structure rather than benchmark-specific artifacts, the work could meaningfully advance controllable generation by treating controllability as an intrinsic property of structured latent geometry instead of relying solely on inference-time conditioning. The introduction of a new multimodal affective benchmark and geometry-aware evaluation metrics constitutes a concrete contribution to the field.

major comments (2)
  1. [Section 3.2] Section 3.2 (Graph priors and OT alignment): The construction and validation of the graph priors for AffectHuman-43K is described at a high level only. It remains unclear whether the graphs are hand-crafted, derived from data statistics, or externally validated against cross-modal semantic trajectories (e.g., affective state transitions). This detail is load-bearing for the central claim that the method produces graph-consistent trajectories without identity drift; absent such validation, reported gains risk being attributable to fitting the specific benchmark rather than the latent-geometry mechanism.
  2. [Section 5] Section 5 (Experiments and ablations): The manuscript reports consistent improvements but provides insufficient quantitative detail on effect sizes, standard deviations, or ablations isolating the graph-constrained OT term versus the factorized representation alone. Without these, it is difficult to assess whether the geometry-aware metrics genuinely support the controllability claims or whether gains could arise from other modeling choices.
minor comments (2)
  1. [Abstract] Abstract: The phrase 'consistent improvements' would be strengthened by including one or two key numerical results (e.g., percentage gains on controllability or identity metrics) to give readers an immediate sense of effect magnitude.
  2. [Section 3] Notation: The distinction between the identity factor and attribute factor embeddings could be clarified with an explicit equation or diagram early in Section 3 to avoid ambiguity when discussing the OT alignment step.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful and constructive feedback. The comments identify important areas where additional detail will strengthen the manuscript's support for the central claims regarding graph-constrained latent geometry. We address each major comment below and commit to the indicated revisions.

read point-by-point responses
  1. Referee: [Section 3.2] Section 3.2 (Graph priors and OT alignment): The construction and validation of the graph priors for AffectHuman-43K is described at a high level only. It remains unclear whether the graphs are hand-crafted, derived from data statistics, or externally validated against cross-modal semantic trajectories (e.g., affective state transitions). This detail is load-bearing for the central claim that the method produces graph-consistent trajectories without identity drift; absent such validation, reported gains risk being attributable to fitting the specific benchmark rather than the latent-geometry mechanism.

    Authors: We agree that the current description in Section 3.2 is high-level and requires expansion to substantiate the core claim. The graph priors are constructed in a data-driven manner by deriving transition probabilities from co-occurrence statistics of affective attribute labels across the multimodal samples in AffectHuman-43K. We will revise Section 3.2 to include the precise construction procedure (including the adjacency matrix computation and edge weighting), along with any steps taken to align the resulting graph with established affective transition patterns. This revision will clarify the distinction between benchmark-specific fitting and the intended latent-geometry mechanism. revision: yes

  2. Referee: [Section 5] Section 5 (Experiments and ablations): The manuscript reports consistent improvements but provides insufficient quantitative detail on effect sizes, standard deviations, or ablations isolating the graph-constrained OT term versus the factorized representation alone. Without these, it is difficult to assess whether the geometry-aware metrics genuinely support the controllability claims or whether gains could arise from other modeling choices.

    Authors: We acknowledge that the experimental reporting in Section 5 would benefit from greater quantitative rigor. While the manuscript already contains ablation variants that remove the OT alignment component, we agree that reporting effect sizes, standard deviations over multiple runs, and a more isolated comparison of the graph-constrained OT term versus the factorized representation alone would better isolate the contribution of the geometry mechanism. We will expand Section 5 with these details, including tables that report means and standard deviations for the key geometry-aware metrics and a dedicated ablation isolating the OT term. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation uses standard optimal transport and graph priors as methodological choices

full rationale

The paper frames controllability as structured latent geometry and aligns identity/attribute factors to graph priors via graph-constrained optimal transport. This is presented as a modeling decision rather than a derivation whose outputs are forced by its own fitted parameters or self-citations. No equations, predictions, or uniqueness theorems are shown to reduce by construction to inputs (e.g., no fitted parameter renamed as prediction, no self-citation load-bearing the central claim). The AffectHuman-43K benchmark and geometry-aware metrics are introduced for evaluation, not as part of a self-referential loop. The approach remains self-contained against external benchmarks and standard techniques, yielding no load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that graph priors can be constructed to represent semantically meaningful attribute trajectories and that optimal transport under those constraints will preserve identity while improving controllability.

axioms (1)
  • domain assumption Graph priors accurately capture semantic relationships and consistent trajectories among attributes across modalities
    Invoked when the method aligns factors with graph priors to encourage graph-consistent trajectories.

pith-pipeline@v0.9.0 · 5689 in / 1193 out tokens · 53820 ms · 2026-05-20T18:34:41.497164+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

64 extracted references · 64 canonical work pages · 12 internal anchors

  1. [1]

    Flamingo: a visual language model for few-shot learning.Advances in neural information processing systems, 35: 23716–23736, 2022

    Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning.Advances in neural information processing systems, 35: 23716–23736, 2022

  2. [2]

    Loosecontrol: Lifting controlnet for generalized depth conditioning

    Shariq Farooq Bhat, Niloy Mitra, and Peter Wonka. Loosecontrol: Lifting controlnet for generalized depth conditioning. InACM SIGGRAPH 2024 Conference Papers, pages 1–11, 2024

  3. [3]

    Instructpix2pix: Learning to follow image editing instructions

    Tim Brooks, Aleksander Holynski, and Alexei A Efros. Instructpix2pix: Learning to follow image editing instructions. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 18392–18402, 2023

  4. [4]

    Emobank: Studying the impact of annotation perspective and representation format on dimensional emotion analysis

    Sven Buechel and Udo Hahn. Emobank: Studying the impact of annotation perspective and representation format on dimensional emotion analysis. InProceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, pages 578–585, 2017

  5. [5]

    Iemocap: Interactive emotional dyadic motion capture database.Language resources and evaluation, 42(4):335–359, 2008

    Carlos Busso, Murtaza Bulut, Chi-Chun Lee, Abe Kazemzadeh, Emily Mower, Samuel Kim, Jeannette N Chang, Sungbok Lee, and Shrikanth S Narayanan. Iemocap: Interactive emotional dyadic motion capture database.Language resources and evaluation, 42(4):335–359, 2008

  6. [6]

    PixArt-$\alpha$: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis

    Junsong Chen, Jincheng Yu, Chongjian Ge, Lewei Yao, Enze Xie, Yue Wu, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, and Zhenguo Li. Pixart- α: Fast training of diffusion transformer for photorealistic text-to-image synthesis, 2023. URL https://arxiv.org/abs/ 2310.00426

  7. [7]

    PaLI: A Jointly-Scaled Multilingual Language-Image Model

    Xi Chen, Xiao Wang, Soravit Changpinyo, Piotr Padlewski, Daniel Salz, Sebastian Goodman, Adam Grycner, Basil Mustafa, Lucas Beyer, et al. Pali: A jointly-scaled multilingual language- image model.arXiv preprint arXiv:2209.06794, 2022

  8. [8]

    Sinkhorn distances: Lightspeed computation of optimal transport.Advances in Neural Information Processing Systems (NeurIPS), 26, 2013

    Marco Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport.Advances in Neural Information Processing Systems (NeurIPS), 26, 2013

  9. [9]

    Emoticrafter: Text-to-emotional-image generation based on valence-arousal model,

    Shengqi Dang, Yi He, Long Ling, Ziqing Qian, Nanxuan Zhao, and Nan Cao. Emoticrafter: Text-to-emotional-image generation based on valence-arousal model, 2025. URL https: //arxiv.org/abs/2501.05710

  10. [10]

    Diffusionrig: Learning personalized priors for facial appearance editing

    Zeyu Ding, Xingang Zhang, Zhanjie Xia, Louis Jebe, Zhuowen Tu, and Xiangyu Zhang. Diffusionrig: Learning personalized priors for facial appearance editing. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 12736–12746, 2023

  11. [11]

    Emoportraits: Emotion-enhanced multimodal one-shot head avatars

    Nikita Drobyshev, Adriana Bigata Casademunt, Konstantinos V ougioukas, Zoe Landgraf, Stavros Petridis, and Maja Pantic. Emoportraits: Emotion-enhanced multimodal one-shot head avatars. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 8498–8507, 2024

  12. [12]

    Scaling Rectified Flow Transformers for High-Resolution Image Synthesis

    Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, Dustin Podell, Tim Dockhorn, Zion English, Kyle Lacey, Alex Goodwin, Yannik Marek, and Robin Rombach. Scaling rectified flow transformers for high-resolution image synthesis, 2024. URL https://arxiv.org/abs/ 2403.03206

  13. [13]

    Gheorghe Comanici et. al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities, 2025. URL https: //arxiv.org/abs/2507.06261

  14. [14]

    OpenAI et. al. Gpt-4o system card, 2024. URLhttps://arxiv.org/abs/2410.21276. 10

  15. [15]

    Imagebind: One embedding space to bind them all

    Rohit Girdhar, Alaaeldin El-Nouby, Zhuang Liu, Mannat Singh, Kaushal V Alwala, Armand Joulin, and Ishan Misra. Imagebind: One embedding space to bind them all. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15180–15190, 2023

  16. [16]

    Demystifying flux architecture.arXiv preprint arXiv:2507.09595, 2025

    Or Greenberg. Demystifying flux architecture.arXiv preprint arXiv:2507.09595, 2025

  17. [17]

    Prompt-to-Prompt Image Editing with Cross Attention Control

    Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. Prompt-to-prompt image editing with cross attention control.arXiv preprint arXiv:2208.01626, 2022

  18. [18]

    Image generation from scene graphs

    Justin Johnson, Agrim Gupta, and Li Fei-Fei. Image generation from scene graphs. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1219– 1228, 2018

  19. [19]

    Progressive Growing of GANs for Improved Quality, Stability, and Variation

    Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive growing of gans for improved quality, stability, and variation.arXiv preprint arXiv:1710.10196, 2017

  20. [20]

    Progressive growing of gans for improved quality, stability, and variation, 2018

    Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive growing of gans for improved quality, stability, and variation, 2018. URL https://arxiv.org/abs/1710. 10196

  21. [21]

    Taxaadapter: Vision taxonomy models are key to fine-grained image generation over the tree of life.arXiv preprint arXiv:2603.26128, 2026

    Mridul Khurana, Amin Karimi Monsefi, Justin Lee, Medha Sawhney, David Carlyn, Julia Chae, Jianyang Gu, Rajiv Ramnath, Sara Beery, Wei-Lun Chao, et al. Taxaadapter: Vision taxonomy models are key to fine-grained image generation over the tree of life.arXiv preprint arXiv:2603.26128, 2026

  22. [22]

    Diffusionclip: Text-guided diffusion models for robust image manipulation.arXiv preprint arXiv:2110.02711, 2022

    Gihyun Kim, Taehoon Kwon, and Jong Chul Ye. Diffusionclip: Text-guided diffusion models for robust image manipulation.arXiv preprint arXiv:2110.02711, 2022

  23. [23]

    Neural relational inference for interacting systems

    Thomas Kipf, Ethan Fetaya, Kuan-Chieh Wang, Max Welling, and Richard Zemel. Neural relational inference for interacting systems. InInternational conference on machine learning, pages 2688–2697. Pmlr, 2018

  24. [24]

    Black Forest Labs, Stephen Batifol, Andreas Blattmann, Frederic Boesel, Saksham Consul, Cyril Diagne, Tim Dockhorn, Jack English, Zion English, Patrick Esser, et al. Flux. 1 kontext: Flow matching for in-context image generation and editing in latent space.arXiv preprint arXiv:2506.15742, 2025

  25. [25]

    Magicbrush: A manually annotated dataset for instruction-guided image editing

    Chenlin Li, Ziyang Chen, Peiye Sun, et al. Magicbrush: A manually annotated dataset for instruction-guided image editing. InAdvances in Neural Information Processing Systems (NeurIPS), Datasets and Benchmarks Track, 2023

  26. [26]

    Controlnet++: Improving conditional controls with efficient consistency feedback: Project page: liming-ai

    Ming Li, Taojiannan Yang, Huafeng Kuang, Jie Wu, Zhaoning Wang, Xuefeng Xiao, and Chen Chen. Controlnet++: Improving conditional controls with efficient consistency feedback: Project page: liming-ai. github. io/controlnet_plus_plus. InEuropean Conference on Computer Vision, pages 129–147. Springer, 2024

  27. [27]

    Reliable crowdsourcing and deep locality-preserving learning for expression recognition in the wild

    Shan Li, Weihong Deng, and JunPing Du. Reliable crowdsourcing and deep locality-preserving learning for expression recognition in the wild. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 2852–2861, 2017

  28. [28]

    Deep learning face attributes in the wild

    Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild. InProceedings of the IEEE international conference on computer vision, pages 3730–3738, 2015

  29. [29]

    Steven R Livingstone and Frank A Russo. The ryerson audio-visual database of emotional speech and song (ravdess): A dynamic, multimodal set of facial and vocal expressions in north american english.PloS one, 13(5):e0196391, 2018

  30. [30]

    Ace++: Instruction-based image creation and editing via context-aware content filling

    Chaojie Mao, Jingfeng Zhang, Yulin Pan, Zeyinzi Jiang, Zhen Han, Yu Liu, and Jingren Zhou. Ace++: Instruction-based image creation and editing via context-aware content filling. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1958–1966, 2025. 11

  31. [31]

    Gromov-wasserstein averaging of kernel and distance matrices

    Hrvoje Máreti´c, Sebastian Claici, Edward Chien, and Justin Solomon. Gromov-wasserstein averaging of kernel and distance matrices. InInternational Conference on Machine Learning (ICML), pages 4424–4433, 2019

  32. [32]

    Direct: Disen- tangled regularization of contrastive trajectories for physics-refined video generation.arXiv preprint arXiv:2603.25931, 2026

    Abolfazl Meyarian, Amin Karimi Monsefi, Rajiv Ramnath, and Ser-Nam Lim. Direct: Disen- tangled regularization of contrastive trajectories for physics-refined video generation.arXiv preprint arXiv:2603.25931, 2026

  33. [33]

    Null-text inversion for editing real images using guided diffusion models

    Ron Mokady, Amir Hertz, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. Null-text inversion for editing real images using guided diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 6038–6047, 2023

  34. [34]

    Ali Mollahosseini, Behzad Hasani, and Mohammad H. Mahoor. Affectnet: A database for facial expression, valence, and arousal computing in the wild.IEEE Transactions on Affective Computing, 10(1):18–31, January 2019. ISSN 2371-9850. doi: 10.1109/taffc.2017.2740923. URLhttp://dx.doi.org/10.1109/TAFFC.2017.2740923

  35. [35]

    FS-DFM: Fast and Accurate Long Text Generation with Few-Step Diffusion Language Models

    Amin Karimi Monsefi, Nikhil Bhendawade, Manuel Rafael Ciosici, Dominic Culver, Yizhe Zhang, and Irina Belousova. Fs-dfm: Fast and accurate long text generation with few-step diffusion language models.arXiv preprint arXiv:2509.20624, 2025

  36. [36]

    Taxadiffusion: Progressively trained diffusion model for fine-grained species generation

    Amin Karimi Monsefi, Mridul Khurana, Rajiv Ramnath, Anuj Karpatne, Wei-Lun Chao, and Cheng Zhang. Taxadiffusion: Progressively trained diffusion model for fine-grained species generation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 8579–8589, 2025

  37. [37]

    Knobgen: Controlling the sophistication of artwork in sketch-based diffusion models

    Pouyan Navard, Amin Karimi Monsefi, Mengxi Zhou, Wei-Lun Chao, Alper Yilmaz, and Rajiv Ramnath. Knobgen: Controlling the sophistication of artwork in sketch-based diffusion models. arXiv preprint arXiv:2410.01595, 2024

  38. [38]

    Styleclip: Text-driven manipulation of stylegan imagery

    Or Patashnik, Zongze Wu, Eli Shechtman, Daniel Cohen-Or, and Dani Lischinski. Styleclip: Text-driven manipulation of stylegan imagery. InProceedings of the IEEE International Conference on Computer Vision (ICCV), pages 2085–2094, 2021

  39. [39]

    Flowchef: Steering of rectified flow models for controlled generations

    Maitreya Patel, Song Wen, Dimitris N Metaxas, and Yezhou Yang. Flowchef: Steering of rectified flow models for controlled generations. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 15308–15318, 2025

  40. [40]

    Scalable Diffusion Models with Transformers

    William Peebles and Saining Xie. Scalable diffusion models with transformers, 2023. URL https://arxiv.org/abs/2212.09748

  41. [41]

    Kosmos-2: Grounding Multimodal Large Language Models to the World

    Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Shaohan Huang, Shuming Ma, and Furu Wei. Kosmos-2: Grounding multimodal large language models to the world, 2023. URL https://arxiv.org/abs/2306.14824

  42. [42]

    Foundations and Trends in Machine Learning, 2019

    Gabriel Peyré and Marco Cuturi.Computational Optimal Transport, volume 11. Foundations and Trends in Machine Learning, 2019

  43. [43]

    SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

    Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis, 2023. URLhttps://arxiv.org/abs/2307.01952

  44. [44]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamila Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational Conference on Machine Learning (ICML), pages 8748–8763. PMLR, 2021

  45. [45]

    Diffuse everything: Multimodal diffusion models on arbitrary state spaces.arXiv preprint arXiv:2506.07903, 2025

    Kevin Rojas, Yuchen Zhu, Sichen Zhu, Felix X-F Ye, and Molei Tao. Diffuse everything: Multimodal diffusion models on arbitrary state spaces.arXiv preprint arXiv:2506.07903, 2025

  46. [46]

    High- resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High- resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10684–10695, 2022. 12

  47. [47]

    Dreambooth: Fine-tuning text-to-image diffusion models for subject-driven generation

    Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. Dreambooth: Fine-tuning text-to-image diffusion models for subject-driven generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 22500–22510, 2023

  48. [48]

    Fleet, and Mohammad Norouzi

    Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Denton, Seyed Kamyar Seyed Ghasemipour, Burcu Karagol Ayan, Sharif Mahdavi, Raphael Lopes, Tim Salimans, Jonathan Ho, David J. Fleet, and Mohammad Norouzi. Imagen: Photorealistic text-to-image diffusion models with deep language understanding. InAdvances in Neural Information Process...

  49. [49]

    Emu edit: Precise image editing via recognition and generation tasks

    Shelly Sheynin, Adam Polyak, Uriel Singer, Yuval Kirstain, Amit Zohar, Oron Ashual, Devi Parikh, and Yaniv Taigman. Emu edit: Precise image editing via recognition and generation tasks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8871–8879, 2024

  50. [50]

    Plug-and-play diffusion features for text-driven image-to-image translation

    Narek Tumanyan, Michal Geyer, Shai Bagon, and Tali Dekel. Plug-and-play diffusion features for text-driven image-to-image translation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1921–1930, 2023

  51. [51]

    Optimal transport for structured data with application on graphs.Proceedings of the 37th International Conference on Machine Learning, 2020

    Titouan Vayer, Laetitia Chapel, Rémi Flamary, Romain Tavenard, and Nicolas Courty. Optimal transport for structured data with application on graphs.Proceedings of the 37th International Conference on Machine Learning, 2020

  52. [52]

    Optimal transport: old and new.Grundlehren der Mathematischen Wis- senschaften, 338, 2008

    Cédric Villani. Optimal transport: old and new.Grundlehren der Mathematischen Wis- senschaften, 338, 2008

  53. [53]

    Tam- ing rectified flow for inversion and editing

    Jiangshan Wang, Junfu Pu, Zhongang Qi, Jiayi Guo, Yue Ma, Nisha Huang, Yuxin Chen, Xiu Li, and Ying Shan. Taming rectified flow for inversion and editing.arXiv preprint arXiv:2411.04746, 2024

  54. [54]

    Graph-based unsupervised disentangled representation learning via multimodal large language models, 2024

    Baao Xie, Qiuyu Chen, Yunnan Wang, Zequn Zhang, Xin Jin, and Wenjun Zeng. Graph-based unsupervised disentangled representation learning via multimodal large language models, 2024. URLhttps://arxiv.org/abs/2407.18999

  55. [55]

    Gromov-wasserstein learning for graph matching and node embedding

    Hongteng Xu, Dixin Luo, and Lawrence Carin. Gromov-wasserstein learning for graph matching and node embedding. InInternational Conference on Machine Learning (ICML), pages 6932– 6941, 2019

  56. [56]

    Emogen: Emotional image content generation with text-to-image diffusion models

    Jingyuan Yang, Jiashi Feng, and Hui Huang. Emogen: Emotional image content generation with text-to-image diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 6358–6368, 2024

  57. [57]

    Emoedit: Evoking emotions through image manipulation

    Jingyuan Yang, Jiashi Feng, Wei Luo, Dani Lischinski, Daniel Cohen-Or, and Hui Huang. Emoedit: Evoking emotions through image manipulation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 24690–24699, 2025

  58. [58]

    Probability density geodesics in image diffusion latent space

    Qingtao Yu, Jaskirat Singh, Zhaoyuan Yang, Peter Henry Tu, Jing Zhang, Hongdong Li, Richard Hartley, and Dylan Campbell. Probability density geodesics in image diffusion latent space. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 27989–27998, 2025

  59. [59]

    Dreamtalk: Diffusion-based realistic emotional audio-driven method for single image talking face generation.arXiv preprint arXiv:2312.13578, 2023

    Chao Zhang, Chen Zhang, Meng Zhang, and In So Kweon. Dreamtalk: Diffusion-based realistic emotional audio-driven method for single image talking face generation.arXiv preprint arXiv:2312.13578, 2023

  60. [60]

    Magicbrush: A manually annotated dataset for instruction-guided image editing.Advances in Neural Information Processing Systems, 36:31428–31449, 2023

    Kai Zhang, Lingbo Mo, Wenhu Chen, Huan Sun, and Yu Su. Magicbrush: A manually annotated dataset for instruction-guided image editing.Advances in Neural Information Processing Systems, 36:31428–31449, 2023

  61. [61]

    Adding conditional control to text-to-image diffusion models

    Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 3836–3847, 2023. 13

  62. [62]

    Enabling instructional image editing with in-context generation in large scale diffusion transformer

    Zechuan Zhang, Ji Xie, Yu Lu, Zongxin Yang, and Yi Yang. Enabling instructional image editing with in-context generation in large scale diffusion transformer. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

  63. [63]

    Ultraedit: Instruction-based fine-grained image editing at scale.Advances in Neural Information Processing Systems, 37:3058–3093, 2024

    Haozhe Zhao, Xiaojian Ma, Liang Chen, Shuzheng Si, Rujie Wu, Kaikai An, Peiyu Yu, Minjia Zhang, Qing Li, and Baobao Chang. Ultraedit: Instruction-based fine-grained image editing at scale.Advances in Neural Information Processing Systems, 37:3058–3093, 2024

  64. [64]

    Bidedpo: Conditional image generation with simultaneous text and condition alignment.arXiv preprint arXiv:2511.19268, 2025

    Dewei Zhou, Mingwei Li, Zongxin Yang, Yu Lu, Yunqiu Xu, Zhizhong Wang, Zeyi Huang, and Yi Yang. Bidedpo: Conditional image generation with simultaneous text and condition alignment.arXiv preprint arXiv:2511.19268, 2025. 14 Appendix Appendix Contents A Theoretical Analysis of Controlla 17 A.1 Structured Controllability as Metric Preservation . . . . . . . ...