Leveraging Foundation Models for Causal Generative Modeling

Aneesh Komanduri; Xintao Wu

arxiv: 2605.23861 · v1 · pith:FEGH4X6Vnew · submitted 2026-05-22 · 💻 cs.LG · cs.AI· cs.CV

Leveraging Foundation Models for Causal Generative Modeling

Aneesh Komanduri , Xintao Wu This is my paper

Pith reviewed 2026-05-25 04:36 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CV

keywords causal generative modelingfoundation modelscounterfactual generationzero-shot causal discoverydiffusion modelsconcept manipulation

0 comments

The pith

Pretrained foundation models enable zero-shot causal discovery and faithful counterfactual image generation via a modular pipeline.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces FM-CGM as a way to chain a large reasoning model for inferring causes with a text-to-image diffusion model for producing altered images. The process extracts concepts from an image, uses the reasoning model to decide on interventions according to a causal graph, and then applies a guidance technique during generation so that only the relevant parts of the image change. This setup is presented as allowing the entire causal workflow to run without any task-specific training or fine-tuning. A sympathetic reader would care because it suggests existing large models already contain enough causal knowledge to support transparent reasoning about visual data.

Core claim

FM-CGM formalizes the causal pipeline through three core components: a concept extractor, a concept manipulator, and a counterfactual generator. By leveraging a large reasoning model for causal inference and a text-to-image diffusion model for generation, the approach enables zero-shot causal discovery, intervention, and counterfactual generation. Causal Semantic Guidance, a cross-attention-based mechanism, ensures semantic interventions propagate to descendant concepts while preserving invariant regions.

What carries the argument

FM-CGM, the modular three-component pipeline (concept extractor, manipulator via large reasoning model, counterfactual generator via diffusion model) together with the Causal Semantic Guidance cross-attention mechanism.

If this is right

Zero-shot identification of plausible causal structures becomes possible on visual data.
Semantic interventions can be applied so that changes affect descendant concepts while leaving invariant regions unchanged.
The same pretrained models support end-to-end causal discovery, intervention, and counterfactual image production.
No task-specific training or domain adaptation is required for the pipeline to operate.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The framework could be tested on video or 3D data if analogous reasoning and generation models exist for those modalities.
It may reduce reliance on curated causal training sets for visual tasks.
Applications in explainable AI could follow if the extracted causal graphs prove stable across different foundation models.

Load-bearing premise

Pretrained foundation models already contain reliable causal reasoning that can be used directly for discovery and faithful image generation without fine-tuning or accuracy checks.

What would settle it

A dataset with known ground-truth causal graphs and interventions where the reasoning model outputs incorrect edges or the generated images violate the intended causal changes.

Figures

Figures reproduced from arXiv: 2605.23861 by Aneesh Komanduri, Xintao Wu.

**Figure 1.** Figure 1: An overview of Foundation Model Powered Causal Generative Model (FM-CGM) consisting of a concept extractor, concept manipulator, and counterfactual generator enabled by foundation models for the causal concepts follows a Markov factorization: 𝑝(𝐶1, ...,𝐶𝑛) = Ö𝑛 𝑖=1 𝑝(𝐶𝑖 |𝐶pa𝑖 ) (6) Concept Manipulator. Given an inferred set of concepts and their causal relationships, a concept manipulator foundation model … view at source ↗

**Figure 2.** Figure 2: Concept extractor prompt 5.1 VLM-based Concept Extractor In this section, we provide the details of the Concept Extractor. We represent the concept extractor as a pretrained reasoning visionlanguage model (VLM) that maps an image to a textual description of a concept. A concept variable is a categorical variable that takes exactly one value at a time and describes the visual scene. The resulting (text) co… view at source ↗

**Figure 3.** Figure 3: Concept manipulator prompt utilize the fast second-order DPM-Solver++ inversion [11] to obtain the latent representation 𝑧𝑇 via forward diffusion. Action. Given the counterfactual states 𝐶 ′ obtained from the VLM Concept Manipulator, we can now utilize classifier-free guidance and disentangled image editing techniques to realize counterfactual image edits. Now, for a given intervened concept 𝐶𝑖 , we cons… view at source ↗

**Figure 4.** Figure 4: Image Counterfactual Generation on human facial characteristics for (a) close-up profile and (b) garden scene [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Counterfactual Generation on weather scenes [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: VLM effectiveness evaluation prompt causes dry hair and fluffy hair) whereas baselines either add high variations or fail to incorporate causal changes. Quantitative Evaluation. We also provide a quantitative evaluation on a subset of the CelebA-HQ and MS-COCO datasets. Given a counterfactual image, we evaluate effectiveness of the intervention by using VLM-Eff, a VLM-based approach to identify whether t… view at source ↗

read the original abstract

Causal generative modeling is essential for developing reliable and transparent AI systems capable of counterfactual reasoning. While existing approaches focus on integrating causal constraints during the training of generative models, they often lack a unified framework to leverage the zero-shot reasoning capabilities of pretrained foundation models. We introduce FM-CGM, a modular framework for end-to-end visual causal reasoning using pretrained foundation models. FM-CGM formalizes the causal pipeline through three core components: a concept extractor, a concept manipulator, and a counterfactual generator. By leveraging a large reasoning model for causal inference and a text-to-image diffusion model for generation, our approach enables zero-shot causal discovery, intervention, and counterfactual generation. We then develop Causal Semantic Guidance (CSG), a cross-attention-based mechanism that ensures semantic interventions propagate to descendant concepts while preserving invariant regions. We empirically show that our approach can identify plausible causal structures and is suitable for faithful counterfactual image generation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper assembles a pipeline from existing foundation models for zero-shot causal discovery and counterfactual image generation, but offers no benchmarks against known causal graphs.

read the letter

The main thing to know is that this work names a modular setup called FM-CGM that runs a concept extractor, feeds concepts to a reasoning model for causal inference, then uses a diffusion model with a new cross-attention trick (CSG) to apply interventions and produce counterfactual images. The CSG part tries to make sure changes to one concept affect only the right descendants while leaving invariants alone. That composition is the concrete addition over prior separate uses of these models. It is presented clearly as three stages and shows how off-the-shelf components can be chained for visual data without retraining. The authors also give the mechanism a name and describe its attention-based propagation, which is a small but explicit engineering step. The evaluation section, however, only states that structures look plausible and generations look suitable. No ground-truth DAG datasets, no precision-recall against interventional data, and no comparison to existing causal discovery methods appear in the description. The zero-shot claim therefore sits on the untested premise that the reasoning model recovers correct edges rather than its own priors. This is the central soft spot and it is not minor. The paper is aimed at people working on trustworthy generative models who want a practical recipe for adding causal steps to image pipelines. A reader already thinking about repurposing LLMs and diffusion models might pick up the CSG idea or the overall staging. It is not for anyone needing verified causal accuracy or new theoretical results. I would send it to peer review because the pipeline idea is worth testing properly; the authors should be asked to add standard causal benchmarks and quantitative checks on the discovery and faithfulness steps before stronger claims are made.

Referee Report

2 major / 0 minor

Summary. The manuscript proposes FM-CGM, a modular framework for end-to-end visual causal reasoning that leverages pretrained foundation models without task-specific fine-tuning. It decomposes the pipeline into a concept extractor, a concept manipulator (using a large reasoning model for zero-shot causal inference), and a counterfactual generator (using a text-to-image diffusion model). The authors introduce Causal Semantic Guidance (CSG), a cross-attention mechanism to propagate semantic interventions to descendant concepts while preserving invariants. The central claim is that this enables zero-shot causal discovery, intervention, and faithful counterfactual generation, with empirical demonstrations that the method identifies 'plausible' causal structures and produces 'suitable' counterfactual images.

Significance. If the zero-shot claims were substantiated with ground-truth benchmarks, the work would offer a potentially significant shift in causal generative modeling by removing the need for causal-constraint integration during training. It would demonstrate that off-the-shelf foundation models can be composed into a causal pipeline, which could accelerate development of transparent counterfactual systems. The modular design and CSG mechanism are conceptually clean, but the absence of any reported quantitative validation against known DAGs or interventional data currently prevents assessment of whether the outputs reflect true causal structure or model priors.

major comments (2)

[Abstract] Abstract: The statement that the approach 'empirically show[s]' identification of plausible causal structures and suitability for faithful counterfactual generation is load-bearing for the central claim, yet the manuscript provides no datasets (synthetic or real with known ground-truth DAGs), metrics (e.g., edge precision/recall, intervention accuracy), baselines, or error analysis. Without these, it is impossible to determine whether the zero-shot causal discovery recovers true edges or merely coherent outputs aligned with the reasoning model's priors.
[Abstract] Abstract and framework description: The zero-shot causal inference step assumes the large reasoning model can extract a correct causal graph from extracted concepts without any validation or fine-tuning; this assumption is central to the FM-CGM pipeline but is not accompanied by any derivation, prompt details, or consistency checks that would allow reproduction or falsification of the claimed causal structures.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive feedback, which identifies key areas where the manuscript's empirical claims and reproducibility details require clarification. We respond to each major comment below and will make targeted revisions to the abstract, framework description, and supplementary material.

read point-by-point responses

Referee: [Abstract] Abstract: The statement that the approach 'empirically show[s]' identification of plausible causal structures and suitability for faithful counterfactual generation is load-bearing for the central claim, yet the manuscript provides no datasets (synthetic or real with known ground-truth DAGs), metrics (e.g., edge precision/recall, intervention accuracy), baselines, or error analysis. Without these, it is impossible to determine whether the zero-shot causal discovery recovers true edges or merely coherent outputs aligned with the reasoning model's priors.

Authors: We agree that the abstract's language overstates the strength of the empirical support. The current manuscript relies on qualitative case studies demonstrating plausible structures and counterfactual images rather than quantitative benchmarks against ground-truth DAGs or interventional data. We will revise the abstract to replace 'empirically show' with 'illustrate through examples' and add a dedicated limitations subsection discussing the challenges of quantitative evaluation in zero-shot visual causal settings. This addresses the concern without altering the modular framework's contribution. revision: yes
Referee: [Abstract] Abstract and framework description: The zero-shot causal inference step assumes the large reasoning model can extract a correct causal graph from extracted concepts without any validation or fine-tuning; this assumption is central to the FM-CGM pipeline but is not accompanied by any derivation, prompt details, or consistency checks that would allow reproduction or falsification of the claimed causal structures.

Authors: We will expand the framework description and add an appendix with the exact prompts provided to the large reasoning model, along with examples of extracted graphs and any multi-run consistency observations from the experiments. While the zero-shot design precludes task-specific fine-tuning by definition, we can include additional details on how the concept extractor outputs are formatted as input to support reproducibility. No derivation of correctness is claimed beyond the model's documented reasoning capabilities. revision: yes

standing simulated objections not resolved

Substantiating zero-shot causal discovery with quantitative metrics (e.g., edge precision/recall on synthetic datasets with known ground-truth DAGs), as no such benchmarks or datasets were included in the original manuscript.

Circularity Check

0 steps flagged

No circularity: framework composes external pretrained models without self-referential reductions

full rationale

The paper defines FM-CGM as a modular pipeline of three components (concept extractor, manipulator, counterfactual generator) that directly invoke off-the-shelf large reasoning models and text-to-image diffusion models for zero-shot causal tasks. No equations appear that equate outputs to inputs by construction, no parameters are fitted on subsets and relabeled as predictions, and no self-citations are used to establish uniqueness or load-bearing premises. CSG is introduced as a new cross-attention mechanism rather than derived from prior author work. The derivation chain therefore remains self-contained against external model capabilities and does not reduce to tautology.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claim rests on the domain assumption that foundation models already encode usable causal knowledge in zero-shot mode; no free parameters or invented entities with independent evidence are described.

axioms (1)

domain assumption Pretrained foundation models possess zero-shot causal reasoning capabilities suitable for inference and intervention.
Directly invoked when stating that a large reasoning model is leveraged for causal inference without additional training.

invented entities (2)

FM-CGM framework no independent evidence
purpose: To provide a modular end-to-end pipeline for visual causal reasoning
Newly introduced named system composed of three components.
Causal Semantic Guidance (CSG) no independent evidence
purpose: To ensure semantic interventions affect descendant concepts while preserving invariants
Newly proposed cross-attention mechanism.

pith-pipeline@v0.9.0 · 5682 in / 1119 out tokens · 33836 ms · 2026-05-25T04:36:54.845521+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

31 extracted references · 31 canonical work pages · 1 internal anchor

[1]

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al . 2025. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

Manuel Brack, Felix Friedrich, Katharia Kornmeier, Linoy Tsaban, Patrick Schramowski, Kristian Kersting, and Apolinario Passos. 2024. LEDITS++: Limit- less Image Editing using Text-to-Image Models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

work page 2024
[3]

Jonathan Ho, Ajay Jain, and Pieter Abbeel. 2020. Denoising Diffusion Probabilistic Models. InAdvances in Neural Information Processing Systems

work page 2020
[4]

Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. 2018. Progressive Growing of GANs for Improved Quality, Stability, and Variation. InInternational Conference on Learning Representations

work page 2018
[5]

Dimakis, and Sriram Vish- wanath

Murat Kocaoglu, Christopher Snyder, Alexandros G. Dimakis, and Sriram Vish- wanath. 2018. CausalGAN: Learning Causal Implicit Generative Models with Adversarial Training. InInternational Conference on Learning Representations

work page 2018
[6]

Aneesh Komanduri, Karuna Bhaila, and Xintao Wu. 2025. CausalVLBench: Benchmarking Visual Causal Reasoning in Large Vision-Language Models. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

work page 2025
[7]

Aneesh Komanduri, Xintao Wu, Yongkai Wu, and Feng Chen. 2024. From Iden- tifiable Causal Representations to Controllable Counterfactual Generation: A Survey on Causal Generative Modeling.Transactions on Machine Learning Re- search(2024)

work page 2024
[8]

Aneesh Komanduri, Yongkai Wu, Feng Chen, and Xintao Wu. 2024. Learning Causally Disentangled Representations via the Principle of Independent Causal Mechanisms. InProceedings of the 33rd International Joint Conference on Artificial Intelligence. Conference’17, July 2017, Washington, DC, USA Komanduri et al

work page 2024
[9]

Aneesh Komanduri, Chen Zhao, Feng Chen, and Xintao Wu. 2024. Causal Diffu- sion Autoencoders: Toward Counterfactual Generation via Diffusion Probabilistic Models. InProceedings of the 27th European Conference on Artificial Intelligence

work page 2024
[10]

Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C

Tsung-Yi Lin, Michael Maire, Serge J. Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. 2014. Microsoft COCO: Common Objects in Context. InEuropean Conference on Computer Vision

work page 2014
[11]

Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. 2025. Dpm-solver++: Fast solver for guided sampling of diffusion probabilistic models. Machine Intelligence Research22, 4 (2025), 730–751

work page 2025
[12]

Chong Mou, Xintao Wang, Liangbin Xie, Yanze Wu, Jian Zhang, Zhongang Qi, and Ying Shan. 2024. T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. InProceedings of the AAAI Conference on Artificial Intelligence

work page 2024
[13]

Alexander Quinn Nichol and Prafulla Dhariwal. 2021. Improved Denoising Diffusion Probabilistic Models. InProceedings of the 38th International Conference on Machine Learning

work page 2021
[14]

Nick Pawlowski, Daniel Coelho de Castro, and Ben Glocker. 2020. Deep Structural Causal Models for Tractable Counterfactual Inference. InAdvances in Neural Information Processing Systems

work page 2020
[15]

2009.Causality(2 ed.)

Judea Pearl. 2009.Causality(2 ed.). Cambridge University Press, Cambridge, UK

work page 2009
[16]

Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. 2024. SDXL: Improving Latent Dif- fusion Models for High-Resolution Image Synthesis. InThe Twelfth International Conference on Learning Representations

work page 2024
[17]

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. 2021. Learning Transferable Visual Models From Natural Language Supervision. InProceedings of the 38th International Conference on Machine Learning

work page 2021
[18]

Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever. 2021. Zero-Shot Text-to-Image Gener- ation. InProceedings of the 38th International Conference on Machine Learning. 8821–8831

work page 2021
[19]

Fabio De Sousa Ribeiro, Ainkaran Santhirasekaram, and Ben Glocker. 2025. Coun- terfactual Identifiability via Dynamic Optimal Transport. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems

work page 2025
[20]

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. 2022. High-Resolution Image Synthesis With Latent Diffusion Models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

work page 2022
[21]

Pedro Sanchez and Sotirios A Tsaftaris. 2022. Diffusion causal models for coun- terfactual estimation.Conference on Causal Learning and Reasoning (CLeaR)

work page 2022
[22]

Bernhard Scholkopf, Francesco Locatello, Stefan Bauer, Nan Rosemary Ke, Nal Kalchbrenner, Anirudh Goyal, and Yoshua Bengio. 2021. Toward Causal Repre- sentation Learning.Proc. IEEE109 (May 2021), 612–634

work page 2021
[23]

Jiaming Song, Chenlin Meng, and Stefano Ermon. 2021. Denoising Diffusion Implicit Models. InInternational Conference on Learning Representations

work page 2021
[24]

Nikos Spyrou, Athanasios Vlontzos, Paraskevas Pegios, Thomas Melistas, Nefeli Gkouti, Yannis Panagakis, Giorgos Papanastasiou, and Sotirios A Tsaftaris. 2025. Causally steered diffusion for automated video counterfactual generation.arXiv preprint arXiv:2506.14404(2025)

work page arXiv 2025
[25]

Balasubramanian, and Amit Sharma

Aniket Vashishtha, Abbavaram Gowtham Reddy, Abhinav Kumar, Saketh Bachu, Vineeth N. Balasubramanian, and Amit Sharma. 2023. Causal Inference using LLM-Guided Discovery. InAAAI 2024 Workshop on ”Are Large Language Models Simply Causal Parrots?”

work page 2023
[26]

Vishal Verma, Sawal Acharya, Devansh Bhardwaj, Samuel Simko, Yongjin Yang, Anahita Haghighat, Dominik Janzing, Mrinmaya Sachan, Bernhard Schölkopf, and Zhijing Jin. 2025. Causal AI Scientist: Facilitating Causal Data Science with Large Language Models. InNeurIPS 2025 Workshop on CauScien: Uncovering Causality in Science

work page 2025
[27]

Zihao Wang, Lin Gui, Jeffrey Negrea, and Victor Veitch. 2023. Concept Algebra for (Score-Based) Text-Controlled Generative Models. InAdvances in Neural Information Processing Systems

work page 2023
[28]

Qiucheng Wu, Yujian Liu, Handong Zhao, Ajinkya Kale, Trung Bui, Tong Yu, Zhe Lin, Yang Zhang, and Shiyu Chang. 2023. Uncovering the Disen- tanglement Capability in Text-to-Image Diffusion Models. In2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 1900–1910. doi:10.1109/CVPR52729.2023.00189

work page doi:10.1109/cvpr52729.2023.00189 2023
[29]

Tao Yang, Yuwang Wang, Yan Lu, and Nanning Zheng. 2023. DisDiff: Unsuper- vised Disentanglement of Diffusion Probabilistic Models. InAdvances in Neural Information Processing Systems

work page 2023
[30]

Matej Zečević, Moritz Willig, Devendra Singh Dhami, and Kristian Kersting. 2023. Causal Parrots: Large Language Models May Talk Causality But Are Not Causal. Transactions on Machine Learning Research(2023)

work page 2023
[31]

Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. 2023. Adding conditional con- trol to text-to-image diffusion models. InProceedings of the IEEE/CVF International Conference on Computer Vision. Received 20 February 2007; revised 12 March 2009; accepted 5 June 2009

work page 2023

[1] [1]

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al . 2025. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[2] [2]

Manuel Brack, Felix Friedrich, Katharia Kornmeier, Linoy Tsaban, Patrick Schramowski, Kristian Kersting, and Apolinario Passos. 2024. LEDITS++: Limit- less Image Editing using Text-to-Image Models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

work page 2024

[3] [3]

Jonathan Ho, Ajay Jain, and Pieter Abbeel. 2020. Denoising Diffusion Probabilistic Models. InAdvances in Neural Information Processing Systems

work page 2020

[4] [4]

Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. 2018. Progressive Growing of GANs for Improved Quality, Stability, and Variation. InInternational Conference on Learning Representations

work page 2018

[5] [5]

Dimakis, and Sriram Vish- wanath

Murat Kocaoglu, Christopher Snyder, Alexandros G. Dimakis, and Sriram Vish- wanath. 2018. CausalGAN: Learning Causal Implicit Generative Models with Adversarial Training. InInternational Conference on Learning Representations

work page 2018

[6] [6]

Aneesh Komanduri, Karuna Bhaila, and Xintao Wu. 2025. CausalVLBench: Benchmarking Visual Causal Reasoning in Large Vision-Language Models. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

work page 2025

[7] [7]

Aneesh Komanduri, Xintao Wu, Yongkai Wu, and Feng Chen. 2024. From Iden- tifiable Causal Representations to Controllable Counterfactual Generation: A Survey on Causal Generative Modeling.Transactions on Machine Learning Re- search(2024)

work page 2024

[8] [8]

Aneesh Komanduri, Yongkai Wu, Feng Chen, and Xintao Wu. 2024. Learning Causally Disentangled Representations via the Principle of Independent Causal Mechanisms. InProceedings of the 33rd International Joint Conference on Artificial Intelligence. Conference’17, July 2017, Washington, DC, USA Komanduri et al

work page 2024

[9] [9]

Aneesh Komanduri, Chen Zhao, Feng Chen, and Xintao Wu. 2024. Causal Diffu- sion Autoencoders: Toward Counterfactual Generation via Diffusion Probabilistic Models. InProceedings of the 27th European Conference on Artificial Intelligence

work page 2024

[10] [10]

Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C

Tsung-Yi Lin, Michael Maire, Serge J. Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. 2014. Microsoft COCO: Common Objects in Context. InEuropean Conference on Computer Vision

work page 2014

[11] [11]

Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. 2025. Dpm-solver++: Fast solver for guided sampling of diffusion probabilistic models. Machine Intelligence Research22, 4 (2025), 730–751

work page 2025

[12] [12]

Chong Mou, Xintao Wang, Liangbin Xie, Yanze Wu, Jian Zhang, Zhongang Qi, and Ying Shan. 2024. T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. InProceedings of the AAAI Conference on Artificial Intelligence

work page 2024

[13] [13]

Alexander Quinn Nichol and Prafulla Dhariwal. 2021. Improved Denoising Diffusion Probabilistic Models. InProceedings of the 38th International Conference on Machine Learning

work page 2021

[14] [14]

Nick Pawlowski, Daniel Coelho de Castro, and Ben Glocker. 2020. Deep Structural Causal Models for Tractable Counterfactual Inference. InAdvances in Neural Information Processing Systems

work page 2020

[15] [15]

2009.Causality(2 ed.)

Judea Pearl. 2009.Causality(2 ed.). Cambridge University Press, Cambridge, UK

work page 2009

[16] [16]

Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. 2024. SDXL: Improving Latent Dif- fusion Models for High-Resolution Image Synthesis. InThe Twelfth International Conference on Learning Representations

work page 2024

[17] [17]

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. 2021. Learning Transferable Visual Models From Natural Language Supervision. InProceedings of the 38th International Conference on Machine Learning

work page 2021

[18] [18]

Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever. 2021. Zero-Shot Text-to-Image Gener- ation. InProceedings of the 38th International Conference on Machine Learning. 8821–8831

work page 2021

[19] [19]

Fabio De Sousa Ribeiro, Ainkaran Santhirasekaram, and Ben Glocker. 2025. Coun- terfactual Identifiability via Dynamic Optimal Transport. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems

work page 2025

[20] [20]

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. 2022. High-Resolution Image Synthesis With Latent Diffusion Models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

work page 2022

[21] [21]

Pedro Sanchez and Sotirios A Tsaftaris. 2022. Diffusion causal models for coun- terfactual estimation.Conference on Causal Learning and Reasoning (CLeaR)

work page 2022

[22] [22]

Bernhard Scholkopf, Francesco Locatello, Stefan Bauer, Nan Rosemary Ke, Nal Kalchbrenner, Anirudh Goyal, and Yoshua Bengio. 2021. Toward Causal Repre- sentation Learning.Proc. IEEE109 (May 2021), 612–634

work page 2021

[23] [23]

Jiaming Song, Chenlin Meng, and Stefano Ermon. 2021. Denoising Diffusion Implicit Models. InInternational Conference on Learning Representations

work page 2021

[24] [24]

Nikos Spyrou, Athanasios Vlontzos, Paraskevas Pegios, Thomas Melistas, Nefeli Gkouti, Yannis Panagakis, Giorgos Papanastasiou, and Sotirios A Tsaftaris. 2025. Causally steered diffusion for automated video counterfactual generation.arXiv preprint arXiv:2506.14404(2025)

work page arXiv 2025

[25] [25]

Balasubramanian, and Amit Sharma

Aniket Vashishtha, Abbavaram Gowtham Reddy, Abhinav Kumar, Saketh Bachu, Vineeth N. Balasubramanian, and Amit Sharma. 2023. Causal Inference using LLM-Guided Discovery. InAAAI 2024 Workshop on ”Are Large Language Models Simply Causal Parrots?”

work page 2023

[26] [26]

Vishal Verma, Sawal Acharya, Devansh Bhardwaj, Samuel Simko, Yongjin Yang, Anahita Haghighat, Dominik Janzing, Mrinmaya Sachan, Bernhard Schölkopf, and Zhijing Jin. 2025. Causal AI Scientist: Facilitating Causal Data Science with Large Language Models. InNeurIPS 2025 Workshop on CauScien: Uncovering Causality in Science

work page 2025

[27] [27]

Zihao Wang, Lin Gui, Jeffrey Negrea, and Victor Veitch. 2023. Concept Algebra for (Score-Based) Text-Controlled Generative Models. InAdvances in Neural Information Processing Systems

work page 2023

[28] [28]

Qiucheng Wu, Yujian Liu, Handong Zhao, Ajinkya Kale, Trung Bui, Tong Yu, Zhe Lin, Yang Zhang, and Shiyu Chang. 2023. Uncovering the Disen- tanglement Capability in Text-to-Image Diffusion Models. In2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 1900–1910. doi:10.1109/CVPR52729.2023.00189

work page doi:10.1109/cvpr52729.2023.00189 2023

[29] [29]

Tao Yang, Yuwang Wang, Yan Lu, and Nanning Zheng. 2023. DisDiff: Unsuper- vised Disentanglement of Diffusion Probabilistic Models. InAdvances in Neural Information Processing Systems

work page 2023

[30] [30]

Matej Zečević, Moritz Willig, Devendra Singh Dhami, and Kristian Kersting. 2023. Causal Parrots: Large Language Models May Talk Causality But Are Not Causal. Transactions on Machine Learning Research(2023)

work page 2023

[31] [31]

Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. 2023. Adding conditional con- trol to text-to-image diffusion models. InProceedings of the IEEE/CVF International Conference on Computer Vision. Received 20 February 2007; revised 12 March 2009; accepted 5 June 2009

work page 2023