pith. sign in

arxiv: 2509.24798 · v6 · pith:BCOTYJIRnew · submitted 2025-09-29 · 💻 cs.CV · cs.AI

Causal-Adapter: Taming Text-to-Image Diffusion for Faithful Counterfactual Generation

Pith reviewed 2026-05-21 21:51 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords counterfactual generationtext-to-image diffusioncausal modelingimage editingstructural causal modelsdiffusion modelsattribute controlidentity preservation
0
0 comments X

The pith

Causal-Adapter adapts frozen text-to-image diffusion models to generate counterfactual images that respect known causal relationships between attributes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Causal-Adapter as a way to add causal structure to existing diffusion models without retraining the core network. It combines a structural causal model with two regularization steps so that changing one attribute reliably updates its causal dependents while leaving unrelated parts of the image untouched. A reader might care because this produces more consistent edits than prompt-only methods, which often create unrealistic or inconsistent changes when applied to tasks such as medical imaging or object simulation.

Core claim

Causal-Adapter adapts frozen text-to-image diffusion backbones for counterfactual image generation. It leverages structural causal modeling together with prompt-aligned injection of causal attributes into textual embeddings and a conditioned token contrastive loss that disentangles factors and reduces spurious correlations. The result supports targeted interventions on chosen attributes while consistently propagating effects to causal dependents and preserving the core identity of the original image.

What carries the argument

Causal-Adapter, which injects a known structural causal model into a frozen diffusion backbone via prompt-aligned injection and a conditioned token contrastive loss to enforce faithful attribute interventions and identity preservation.

If this is right

  • Targeted changes to one attribute produce consistent updates to all its causal descendants in the generated image.
  • The same frozen diffusion backbone can be reused across different causal graphs by swapping only the adapter components.
  • High-fidelity medical images such as MRIs can be edited while keeping patient identity intact.
  • Quantitative gains appear on both synthetic benchmarks and real-world datasets without retraining the underlying model.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same adapter pattern could be tested on text-to-video or text-to-3D models if their causal structures are supplied.
  • Future work might explore learning the causal graph directly from data instead of requiring it as input.
  • If the contrastive loss term is removed, attribute disentanglement would likely degrade on datasets with many interdependent factors.

Load-bearing premise

The method assumes that an accurate structural causal model of the target attributes is already known and can be specified correctly in advance.

What would settle it

Running the method on a dataset where the supplied causal graph is deliberately incorrect and observing that counterfactual accuracy falls to or below the level of ordinary prompt engineering would falsify the central claim.

Figures

Figures reproduced from arXiv: 2509.24798 by Chaochao Lu, Chen Jin, Dino Oglic, Lei Tong, Philip Teare, Sotirios A. Tsaftaris, Tom Diethe, Zhihua Liu.

Figure 1
Figure 1. Figure 1: Non-causal editing modifies only the target attribute (e.g. age, gender); causal editing propagates changes to related attributes (e.g. beard, baldness) enforced by the causal graph. Answering counterfactual questions (e.g. infer￾ring what an event would have happened un￾der an alternative action) requires understanding the cause–effect relationships among variables and performing hypothetical reasoning (P… view at source ↗
Figure 2
Figure 2. Figure 2: A sketch comparison of counterfactual image generation methods based on: (a) [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Motivational study and preliminary counterfactual generation results between T2I methods and Causal-Adapter. (a) Fine-grained anatomical counterfactual editing of brain ventricular volume using inversion-based editing (NTI (Mokady et al., 2023)), multi-concept prompt-learning editing (MCPL (Jin et al., 2024)), and our approach. (b) Comparison of counterfactual editing results on human faces. (c) Averaged c… view at source ↗
Figure 4
Figure 4. Figure 4: Method overview. A counterfactual prompt and input image [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Pendulum counterfactuals with traversal edit [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: CelebA counterfactuals from Causal￾Adapter compared with prior methods. Human Face Counterfactuals. Following the benchmarking of Melistas et al. (2024), we evaluate Causal-Adapter on CelebA test set for human face counterfactual generation across four categorical attributes (age, gender, beard, bald) with the causal graph shown in [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: ADNI brain MRI counterfactual results from Causal-Adapter. Direct causal effects are [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Ablation study on CelebA valida￾tion set. (a) Average intervention effective￾ness. (b) Realism and minimality. (c) Quali￾tative examples, with dotted boxes indicating results of localized editing. 4 CONCLUSION We introduced Causal-Adapter to tame Text-to-Image diffusion models for counterfactual image generation. Our motivational study revealed that current Text-to￾Image diffusion model based editing appro… view at source ↗
Figure 9
Figure 9. Figure 9: Null-Textual Inversion (NTI) relies heavily on prompt engineering, where minor word [PITH_FULL_IMAGE:figures/full_fig_p020_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Multi-Concept Prompt Learning (MCPL) as a representative prompt-learning baseline. [PITH_FULL_IMAGE:figures/full_fig_p021_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Fine-grained anatomical counterfactual editing of brain ventricular volume. NTI and [PITH_FULL_IMAGE:figures/full_fig_p022_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Impact of guidance scale on FID and CLD across three Causal-Adapter variants. Note that [PITH_FULL_IMAGE:figures/full_fig_p026_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Impact of DDIM steps on FID and CLD 27 [PITH_FULL_IMAGE:figures/full_fig_p027_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Counterfactuals from Causal-Adapter variants under different guidance scales. The plain [PITH_FULL_IMAGE:figures/full_fig_p029_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Full ablation visualizations with optional attention guidance (AG). Causal-Adapter with [PITH_FULL_IMAGE:figures/full_fig_p029_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Average cross-attention maps from Causal-Adapter variants. Tokens denote attributes: [PITH_FULL_IMAGE:figures/full_fig_p030_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Pendulum counterfactuals from Causal-Adapter. [PITH_FULL_IMAGE:figures/full_fig_p031_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Pendulum counterfactuals from Causal-Adapter. [PITH_FULL_IMAGE:figures/full_fig_p032_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Additional counterfactual results on the CelebA dataset (with edit samples selected in a non [PITH_FULL_IMAGE:figures/full_fig_p033_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: Additional counterfactual results on the CelebA dataset (with edit samples selected in a [PITH_FULL_IMAGE:figures/full_fig_p034_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: Additional counterfactual results from random interventions on each attribute in the [PITH_FULL_IMAGE:figures/full_fig_p035_21.png] view at source ↗
Figure 22
Figure 22. Figure 22: Additional counterfactual results from random interventions on each attribute in the ADNI [PITH_FULL_IMAGE:figures/full_fig_p036_22.png] view at source ↗
Figure 23
Figure 23. Figure 23: Average cross-attention maps from Causal-Adapter on CelebA dataset. Token denote [PITH_FULL_IMAGE:figures/full_fig_p037_23.png] view at source ↗
Figure 24
Figure 24. Figure 24: Average cross-attention maps from Causal-Adapter on ADNI dataset. Token denote [PITH_FULL_IMAGE:figures/full_fig_p037_24.png] view at source ↗
Figure 25
Figure 25. Figure 25: Average cross-attention maps from Causal-Adapter on Pendulum dataset. Token denote [PITH_FULL_IMAGE:figures/full_fig_p038_25.png] view at source ↗
Figure 26
Figure 26. Figure 26: Counterfactuals generated by Causal-Adapter on CelebA under beard interventions. [PITH_FULL_IMAGE:figures/full_fig_p039_26.png] view at source ↗
read the original abstract

We present Causal-Adapter, a modular framework that adapts frozen text-to-image diffusion backbones for counterfactual image generation. Our method supports causal interventions on target attributes and consistently propagates their effects to causal dependents while preserving the core identity of the image. Unlike prior approaches that rely on prompt engineering without explicit causal structure, Causal-Adapter leverages structural causal modeling with two attribute-regularization strategies: (i) prompt-aligned injection, which aligns causal attributes with textual embeddings for precise semantic control, and (ii) a conditioned token contrastive loss that disentangles attribute factors and reduces spurious correlations. Causal-Adapter achieves state-of-the-art performance on both synthetic and real-world datasets, including up to a 91% reduction in MAE on Pendulum for accurate attribute control and up to an 87% reduction in FID on ADNI for high-fidelity MRI generation. These results demonstrate robust, generalizable counterfactual editing with faithful attribute modification and strong identity preservation. Code and models will be released at: https://leitong02.github.io/causaladapter/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces Causal-Adapter, a modular framework that adapts frozen text-to-image diffusion backbones for counterfactual image generation. It uses structural causal models to support interventions on target attributes, combined with prompt-aligned injection and a conditioned token contrastive loss to propagate effects to causal dependents while preserving image identity. The paper reports state-of-the-art results, including up to 91% MAE reduction on the Pendulum dataset and 87% FID reduction on the ADNI dataset.

Significance. If the empirical claims hold after addressing the noted concerns, the work would advance faithful counterfactual generation in diffusion models by explicitly incorporating causal structure, offering a practical alternative to prompt engineering. The modular adapter design with a frozen backbone is a clear strength for efficient deployment, and the focus on both synthetic and real-world medical imaging datasets highlights potential applicability in causal inference tasks.

major comments (2)
  1. [Abstract] Abstract: The SOTA performance claims (91% MAE reduction on Pendulum and 87% FID on ADNI) are central to the contribution, yet they rest on the untested assumption that a correctly specified SCM combined with prompt-aligned injection and conditioned token contrastive loss will propagate interventions faithfully through the frozen backbone without new spurious correlations. The manuscript provides no sensitivity analysis or validation of the SCM specification for complex attributes in ADNI.
  2. [Methods] Methods (regularization strategies): The conditioned token contrastive loss is described as disentangling attribute factors, but it is unclear how this token-level mechanism guarantees image-level causal consistency for downstream dependents. A failure here would mean the reported metrics reflect improved editing rather than true counterfactual faithfulness, directly affecting the central claim.
minor comments (1)
  1. [Abstract] The abstract would benefit from a short statement on the number of runs or statistical significance for the reported percentage reductions to strengthen the quantitative claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. We address each major point below and indicate where revisions will be made to improve clarity and rigor.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The SOTA performance claims (91% MAE reduction on Pendulum and 87% FID on ADNI) are central to the contribution, yet they rest on the untested assumption that a correctly specified SCM combined with prompt-aligned injection and conditioned token contrastive loss will propagate interventions faithfully through the frozen backbone without new spurious correlations. The manuscript provides no sensitivity analysis or validation of the SCM specification for complex attributes in ADNI.

    Authors: We acknowledge that the manuscript does not present a dedicated sensitivity analysis for SCM specification on complex ADNI attributes. The SCM is derived from established domain knowledge (e.g., age influencing ventricular volume and cortical thickness). Empirical validation is provided through quantitative metrics (MAE/FID reductions) and qualitative checks showing faithful propagation without obvious spurious artifacts. In revision we will add a dedicated paragraph discussing SCM construction, its assumptions, and limitations for complex attributes. revision: partial

  2. Referee: [Methods] Methods (regularization strategies): The conditioned token contrastive loss is described as disentangling attribute factors, but it is unclear how this token-level mechanism guarantees image-level causal consistency for downstream dependents. A failure here would mean the reported metrics reflect improved editing rather than true counterfactual faithfulness, directly affecting the central claim.

    Authors: The contrastive loss is applied to prompt tokens that serve as the conditioning signal for the entire diffusion process. By pulling apart embeddings of causally related versus unrelated attribute tokens, it reduces spurious correlations in the latent space that the frozen backbone then uses to synthesize the full image. Because generation is holistic, token-level disentanglement translates to image-level consistency for dependents. We will expand the methods section with a clearer step-by-step explanation of this propagation and reference the ablation results that isolate the loss contribution. revision: partial

Circularity Check

0 steps flagged

No significant circularity; empirical claims rest on external validation

full rationale

The paper introduces a modular adapter that injects causal interventions into a frozen diffusion backbone via prompt-aligned injection and a conditioned token contrastive loss, assuming a pre-specified SCM. Performance metrics (MAE reduction on Pendulum, FID on ADNI) are reported as experimental outcomes on held-out data rather than quantities algebraically forced by the method's own equations or by self-citation. No derivation step equates a prediction to a fitted input by construction, and the central claims remain falsifiable through independent benchmarks outside the fitted regularization weights.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the existence of a usable structural causal model for the image attributes and on the assumption that the diffusion backbone can be steered by the proposed injection and contrastive mechanisms without retraining.

axioms (1)
  • domain assumption Image attributes obey a known or specifiable causal graph that can be used to guide interventions.
    The method explicitly leverages structural causal modeling to propagate attribute changes.

pith-pipeline@v0.9.0 · 5738 in / 1230 out tokens · 42939 ms · 2026-05-21T21:51:31.496132+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

69 extracted references · 69 canonical work pages · 3 internal anchors

  1. [1]

    Fixing a broken elbo

    Alexander Alemi, Ben Poole, Ian Fischer, Joshua Dillon, Rif A Saurous, and Kevin Murphy. Fixing a broken elbo. In International conference on machine learning, pp.\ 159--168. PMLR, 2018

  2. [2]

    Instructpix2pix: Learning to follow image editing instructions

    Tim Brooks, Aleksander Holynski, and Alexei A Efros. Instructpix2pix: Learning to follow image editing instructions. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.\ 18392--18402, 2023

  3. [3]

    A simple framework for contrastive learning of visual representations

    Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. In International conference on machine learning, pp.\ 1597--1607. PmLR, 2020

  4. [4]

    High fidelity image counterfactuals with probabilistic causal models

    Fabio De Sousa Ribeiro, Tian Xia, Miguel Monteiro, Nick Pawlowski, and Ben Glocker. High fidelity image counterfactuals with probabilistic causal models. In Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pp.\ 7390--7425, 23--29 Jul 2023. URL https://proceedings.mlr.press/v202/d...

  5. [5]

    Prompt tuning inversion for text-driven image editing using diffusion models

    Wenkai Dong, Song Xue, Xiaoyue Duan, and Shumin Han. Prompt tuning inversion for text-driven image editing using diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.\ 7430--7440, 2023

  6. [6]

    An image is worth one word: Personalizing text-to-image generation using textual inversion

    Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patashnik, Amit Haim Bermano, Gal Chechik, and Daniel Cohen-or. An image is worth one word: Personalizing text-to-image generation using textual inversion. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=NAQvF08TcyG

  7. [7]

    Generative adversarial nets

    Ian J Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. Advances in neural information processing systems, 27, 2014

  8. [8]

    Prompt-to-prompt image editing with cross-attention control

    Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman, Yael Pritch, and Daniel Cohen-or. Prompt-to-prompt image editing with cross-attention control. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=_CDixzkzeyb

  9. [9]

    Gans trained by a two time-scale update rule converge to a local nash equilibrium

    Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30, 2017

  10. [10]

    beta-vae: Learning basic visual concepts with a constrained variational framework

    Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick, Shakir Mohamed, and Alexander Lerchner. beta-vae: Learning basic visual concepts with a constrained variational framework. In International conference on learning representations, 2017

  11. [11]

    Classifier-free diffusion guidance

    Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. In NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications, 2021. URL https://openreview.net/forum?id=qw8AKxfYbI

  12. [12]

    Denoising diffusion probabilistic models

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33: 0 6840--6851, 2020

  13. [13]

    Composer: Creative and controllable image synthesis with composable conditions

    Lianghua Huang, Di Chen, Yu Liu, Yujun Shen, Deli Zhao, and Jingren Zhou. Composer: Creative and controllable image synthesis with composable conditions. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett (eds.), Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proce...

  14. [14]

    Diffusion model-based image editing: A survey

    Yi Huang, Jiancheng Huang, Yifan Liu, Mingfu Yan, Jiaxi Lv, Jianzhuang Liu, Wei Xiong, He Zhang, Liangliang Cao, and Shifeng Chen. Diffusion model-based image editing: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

  15. [15]

    An edit friendly ddpm noise space: Inversion and manipulations

    Inbar Huberman-Spiegelglas, Vladimir Kulikov, and Tomer Michaeli. An edit friendly ddpm noise space: Inversion and manipulations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 12469--12478, 2024

  16. [16]

    An image is worth multiple words: Discovering object level concepts using multi-concept prompt learning

    Chen Jin, Ryutaro Tanno, Amrutha Saseendran, Tom Diethe, and Philip Alexander Teare. An image is worth multiple words: Discovering object level concepts using multi-concept prompt learning. In Forty-first International Conference on Machine Learning, 2024

  17. [17]

    Pnp inversion: Boosting diffusion-based editing with 3 lines of code

    Xuan Ju, Ailing Zeng, Yuxuan Bian, Shaoteng Liu, and Qiang Xu. Pnp inversion: Boosting diffusion-based editing with 3 lines of code. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=FoMZ4ljhVw

  18. [18]

    Auto-Encoding Variational Bayes

    Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013

  19. [19]

    Dimakis, and Sriram Vishwanath

    Murat Kocaoglu, Christopher Snyder, Alexandros G. Dimakis, and Sriram Vishwanath. Causal GAN : Learning causal implicit generative models with adversarial training. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=BJE-4xW0W

  20. [20]

    From identifiable causal representations to controllable counterfactual generation: A survey on causal generative modeling

    Aneesh Komanduri, Xintao Wu, Yongkai Wu, and Feng Chen. From identifiable causal representations to controllable counterfactual generation: A survey on causal generative modeling. Transactions on Machine Learning Research, 2024 a . ISSN 2835-8856. URL https://openreview.net/forum?id=PUpZXvNqmb

  21. [21]

    Learning causally disentangled representations via the principle of independent causal mechanisms

    Aneesh Komanduri, Yongkai Wu, Feng Chen, and Xintao Wu. Learning causally disentangled representations via the principle of independent causal mechanisms. In Kate Larson (ed.), Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence, IJCAI-24 , pp.\ 4308--4316. International Joint Conferences on Artificial Intelligence Or...

  22. [22]

    Causal diffusion autoencoders: Toward counterfactual generation via diffusion probabilistic models

    Aneesh Komanduri, Chen Zhao, Feng Chen, and Xintao Wu. Causal diffusion autoencoders: Toward counterfactual generation via diffusion probabilistic models. European Conference on Artificial Intelligence, 2024 c

  23. [23]

    Applying guidance in a limited interval improves sample and distribution quality in diffusion models

    Tuomas Kynk \"a \"a nniemi, Miika Aittala, Tero Karras, Samuli Laine, Timo Aila, and Jaakko Lehtinen. Applying guidance in a limited interval improves sample and distribution quality in diffusion models. Advances in Neural Information Processing Systems, 37: 0 122458--122483, 2024

  24. [24]

    Dispose: Disentangling pose guidance for controllable human image animation

    Hongxiang Li, Yaowei Li, Yuhang Yang, Junjie Cao, Zhihong Zhu, Xuxin Cheng, and Long Chen. Dispose: Disentangling pose guidance for controllable human image animation. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=AumOa10MKG

  25. [25]

    Causal representation learning via counterfactual intervention

    Xiutian Li, Siqi Sun, and Rui Feng. Causal representation learning via counterfactual intervention. In Proceedings of the AAAI conference on artificial intelligence, volume 38, pp.\ 3234--3242, 2024

  26. [26]

    Gligen: Open-set grounded text-to-image generation

    Yuheng Li, Haotian Liu, Qingyang Wu, Fangzhou Mu, Jianwei Yang, Jianfeng Gao, Chunyuan Li, and Yong Jae Lee. Gligen: Open-set grounded text-to-image generation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.\ 22511--22521, 2023

  27. [27]

    Segment anyword: Mask prompt inversion for open-set grounded segmentation

    Zhihua Liu, Amrutha Saseendran, Lei Tong, Xilin He, Fariba Yousefi, Nikolay Burlutskiy, Dino Oglic, Tom Diethe, Philip Alexander Teare, Huiyu Zhou, and Chen Jin. Segment anyword: Mask prompt inversion for open-set grounded segmentation. In Forty-second International Conference on Machine Learning, 2025. URL https://openreview.net/forum?id=9bzgpYtQZn

  28. [28]

    Deep learning face attributes in the wild

    Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild. In Proceedings of International Conference on Computer Vision (ICCV), December 2015

  29. [29]

    Benchmarking counterfactual image generation

    Thomas Melistas, Nikos Spyrou, Nefeli Gkouti, Pedro Sanchez, Athanasios Vlontzos, Yannis Panagakis, Giorgos Papanastasiou, and Sotirios Tsaftaris. Benchmarking counterfactual image generation. Advances in Neural Information Processing Systems, 37: 0 133207--133230, 2024

  30. [30]

    Negative-prompt inversion: Fast image inversion for editing with text-guided diffusion models

    Daiki Miyake, Akihiro Iohara, Yu Saito, and Toshiyuki Tanaka. Negative-prompt inversion: Fast image inversion for editing with text-guided diffusion models. In 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pp.\ 2063--2072. IEEE, 2025

  31. [31]

    Null-text inversion for editing real images using guided diffusion models

    Ron Mokady, Amir Hertz, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. Null-text inversion for editing real images using guided diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.\ 6038--6047, 2023

  32. [32]

    Castro, and Ben Glocker

    Miguel Monteiro, Fabio De Sousa Ribeiro, Nick Pawlowski, Daniel C. Castro, and Ben Glocker. Measuring axiomatic soundness of counterfactual image models. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=lZOUQQvwI3q

  33. [33]

    T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models

    Chong Mou, Xintao Wang, Liangbin Xie, Yanze Wu, Jian Zhang, Zhongang Qi, and Ying Shan. T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. In Proceedings of the AAAI conference on artificial intelligence, volume 38, pp.\ 4296--4304, 2024

  34. [34]

    Representation Learning with Contrastive Predictive Coding

    Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018

  35. [35]

    Counterfactual image editing

    Yushu Pan and Elias Bareinboim. Counterfactual image editing. In Forty-first International Conference on Machine Learning, 2024. URL https://openreview.net/forum?id=OXzkw7vFIO

  36. [36]

    Normalizing flows for probabilistic modeling and inference

    George Papamakarios, Eric Nalisnick, Danilo Jimenez Rezende, Shakir Mohamed, and Balaji Lakshminarayanan. Normalizing flows for probabilistic modeling and inference. Journal of Machine Learning Research, 22 0 (57): 0 1--64, 2021

  37. [37]

    Deep structural causal models for tractable counterfactual inference

    Nick Pawlowski, Daniel Coelho de Castro, and Ben Glocker. Deep structural causal models for tractable counterfactual inference. Advances in neural information processing systems, 33: 0 857--869, 2020

  38. [38]

    Causality

    Judea Pearl. Causality. Cambridge university press, 2009

  39. [39]

    Causal inference

    Judea Pearl. Causal inference. Causality: objectives and assessment, pp.\ 39--58, 2010

  40. [40]

    Structural counterfactuals: A brief introduction

    Judea Pearl. Structural counterfactuals: A brief introduction. Cognitive science, 37 0 (6): 0 977--985, 2013

  41. [41]

    Alzheimer's disease neuroimaging initiative (adni) clinical characterization

    Ronald Carl Petersen, Paul S Aisen, Laurel A Beckett, Michael C Donohue, Anthony Collins Gamst, Danielle J Harvey, CR Jack Jr, William J Jagust, Leslie M Shaw, Arthur W Toga, et al. Alzheimer's disease neuroimaging initiative (adni) clinical characterization. Neurology, 74 0 (3): 0 201--209, 2010

  42. [42]

    Diffusion autoencoders: Toward a meaningful and decodable representation

    Konpat Preechakul, Nattanat Chatthee, Suttisak Wizadwongsa, and Supasorn Suwajanakorn. Diffusion autoencoders: Toward a meaningful and decodable representation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.\ 10619--10629, 2022

  43. [43]

    Enhancing spatiotemporal disease progression models via latent diffusion and prior knowledge

    Lemuel Puglisi, Daniel C Alexander, and Daniele Rav \` . Enhancing spatiotemporal disease progression models via latent diffusion and prior knowledge. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp.\ 173--183. Springer, 2024

  44. [44]

    Diffusion counterfactual generation with semantic abduction

    Rajat R Rasal, Avinash Kori, Fabio De Sousa Ribeiro, Tian Xia, and Ben Glocker. Diffusion counterfactual generation with semantic abduction. In Forty-second International Conference on Machine Learning, 2025. URL https://openreview.net/forum?id=Wqrqcc8O2v

  45. [45]

    High-resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj \"o rn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.\ 10684--10695, 2022

  46. [46]

    Tsaftaris

    Pedro Sanchez and Sotirios A. Tsaftaris. Diffusion causal models for counterfactual estimation. In First Conference on Causal Learning and Reasoning, 2022. URL https://openreview.net/forum?id=LAAZLZIMN-o

  47. [47]

    Toward causal representation learning

    Bernhard Sch \"o lkopf, Francesco Locatello, Stefan Bauer, Nan Rosemary Ke, Nal Kalchbrenner, Anirudh Goyal, and Yoshua Bengio. Toward causal representation learning. Proceedings of the IEEE, 109 0 (5): 0 612--634, 2021

  48. [48]

    Weakly supervised disentangled generative causal representation learning

    Xinwei Shen, Furui Liu, Hanze Dong, Qing Lian, Zhitang Chen, and Tong Zhang. Weakly supervised disentangled generative causal representation learning. Journal of Machine Learning Research, 23 0 (241): 0 1--55, 2022

  49. [49]

    Deep unsupervised learning using nonequilibrium thermodynamics

    Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. In International conference on machine learning, pp.\ 2256--2265. pmlr, 2015

  50. [50]

    Denoising diffusion implicit models

    Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. International Conference on Learning Representations, 2021

  51. [51]

    Causally steered diffusion for automated video counterfactual generation

    Nikos Spyrou, Athanasios Vlontzos, Paraskevas Pegios, Thomas Melistas, Nefeli Gkouti, Yannis Panagakis, Giorgos Papanastasiou, and Sotirios A Tsaftaris. Causally steered diffusion for automated video counterfactual generation. arXiv preprint arXiv:2506.14404, 2025

  52. [52]

    Diff-def: Diffusion-generated deformation fields for conditional atlases

    Sophie Starck, Vasiliki Sideri-Lampretsa, Bernhard Kainz, Martin J Menten, Tamara T Mueller, and Daniel Rueckert. Diff-def: Diffusion-generated deformation fields for conditional atlases. IEEE Transactions on Medical Imaging, 2025

  53. [53]

    Going deeper with convolutions

    Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.\ 1--9, 2015

  54. [54]

    Nvae: A deep hierarchical variational autoencoder

    Arash Vahdat and Jan Kautz. Nvae: A deep hierarchical variational autoencoder. Advances in neural information processing systems, 33: 0 19667--19679, 2020

  55. [55]

    Concept decomposition for visual exploration and inspiration

    Yael Vinker, Andrey Voynov, Daniel Cohen-Or, and Ariel Shamir. Concept decomposition for visual exploration and inspiration. ACM Transactions on Graphics (TOG), 42 0 (6): 0 1--13, 2023

  56. [56]

    Causality from bottom to top: a survey

    Abraham Itzhak Weinberg, Cristiano Premebida, and Diego Resende Faria. Causality from bottom to top: a survey. arXiv preprint arXiv:2403.11219, 2024

  57. [57]

    Learning likelihoods with conditional normalizing flows, 2020

    Christina Winkler, Daniel Worrall, Emiel Hoogeboom, and Max Welling. Learning likelihoods with conditional normalizing flows, 2020. URL https://openreview.net/forum?id=rJg3zxBYwH

  58. [58]

    Counterfactual generative modeling with variational causal inference

    Yulun Wu, Louie McConnell, and Claudia Iriondo. Counterfactual generative modeling with variational causal inference. International Conference on Learning Representations, 2025

  59. [59]

    Factored Classifier-Free Guidance

    Tian Xia, Fabio De Sousa Ribeiro, Rajat R Rasal, Avinash Kori, Raghav Mehta, and Ben Glocker. Decoupled classifier-free guidance for counterfactual diffusion models. arXiv preprint arXiv:2506.14399, 2025

  60. [60]

    Inversion-free image editing with language-guided diffusion models

    Sihan Xu, Yidong Huang, Jiayi Pan, Ziqiao Ma, and Joyce Chai. Inversion-free image editing with language-guided diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 9452--9461, 2024

  61. [61]

    Causalvae: Disentangled representation learning via neural structural causal models

    Mengyue Yang, Furui Liu, Zhitang Chen, Xinwei Shen, Jianye Hao, and Jun Wang. Causalvae: Disentangled representation learning via neural structural causal models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.\ 9593--9602, 2021

  62. [62]

    Diffusion model with cross attention as an inductive bias for disentanglement

    Tao Yang, Cuiling Lan, Yan Lu, and Nanning Zheng. Diffusion model with cross attention as an inductive bias for disentanglement. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

  63. [63]

    Adding conditional control to text-to-image diffusion models

    Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF international conference on computer vision, pp.\ 3836--3847, 2023

  64. [64]

    The unreasonable effectiveness of deep features as a perceptual metric

    Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.\ 586--595, 2018

  65. [65]

    Uni-controlnet: All-in-one control to text-to-image diffusion models

    Shihao Zhao, Dongdong Chen, Yen-Chun Chen, Jianmin Bao, Shaozhe Hao, Lu Yuan, and Kwan-Yee K Wong. Uni-controlnet: All-in-one control to text-to-image diffusion models. Advances in Neural Information Processing Systems, 36: 0 11127--11150, 2023

  66. [66]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

  67. [67]

    @esa (Ref

    \@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...

  68. [68]

    \@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...

  69. [69]

    @open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...