Diff-CA: Separating Common and Salient Factors with Diffusion Models

Alasdair Newson; Alexandre Fournier Montgieux; Micha\"el Soumm; Pietro Gori; Yunlong He

arxiv: 2606.06120 · v1 · pith:GQGBPXYGnew · submitted 2026-06-04 · 💻 cs.CV

Diff-CA: Separating Common and Salient Factors with Diffusion Models

Micha\"el Soumm , Alexandre Fournier Montgieux , Yunlong He , Pietro Gori , Alasdair Newson This is my paper

Pith reviewed 2026-06-28 01:45 UTC · model grok-4.3

classification 💻 cs.CV

keywords contrastive analysisdiffusion modelsfactor separationimage generationidentifiabilityweak supervisioncommon factorssalient factors

0 comments

The pith

Diffusion models separate common and salient factors in images via identifiable additive factorization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a conditioning framework for diffusion models to separate factors shared between two image distributions from those unique to one distribution. It first trains a prompt-free image-conditioned diffusion model and then decomposes the conditioning signal into common and salient parts with weak supervision. This avoids the reconstruction and quality problems that limited earlier methods based on VAEs and GANs. The authors prove the additive contrastive factorization is identifiable under mild conditions, which supports reliable separation and operations that edit only the salient factor.

Core claim

By training a prompt-free image-conditioned diffusion model and decomposing its conditioning into common and salient factors using weak supervision, contrastive decomposition is performed while preserving generation quality. The additive contrastive factorization commonly assumed in prior work is proven identifiable under mild conditions. This factorization enables targeted operations such as swapping or interpolating only the salient factor.

What carries the argument

Conditioning decomposition in an image-conditioned diffusion model that isolates common and salient factors, with the identifiability of the additive contrastive factorization.

If this is right

High-fidelity image generation and edition remain possible during contrastive decomposition.
Targeted edits are achieved by swapping or interpolating only the salient factor.
The approach extends contrastive analysis to domains where prior generative methods were limited by image quality.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The identifiability result may support similar factorizations in conditioning mechanisms of other generative models.
The weak supervision step for decomposition could be tested with stronger or weaker signals to measure robustness.
The framework suggests applications in editing tasks that require changing only distribution-specific attributes while preserving shared structure.

Load-bearing premise

The data must follow an additive contrastive factorization model and the mild conditions for identifiability must hold in the image domains tested.

What would settle it

A controlled pair of image distributions where multiple distinct decompositions into common and salient factors produce identical observed data would falsify the identifiability claim.

Figures

Figures reproduced from arXiv: 2606.06120 by Alasdair Newson, Alexandre Fournier Montgieux, Micha\"el Soumm, Pietro Gori, Yunlong He.

**Figure 1.** Figure 1: Contrastive Analysis separates common from salient factors between two unpaired data distributions using only weak binary supervision (i.e., dataset-level). The learned common and salient latent spaces can be manipulated for editing or knowledge discovery, for instance. based on complex, salient patterns that can be difficult to capture using text prompts or that are not necessarily known in advance. The i… view at source ↗

**Figure 2.** Figure 2: Conceptual view of our decomposition. Factors (S, C) generate an observed image X. We learn a feature extractor fθ that projects the images into latents Z, and an encoder EθE that splits Z, into (Zˆ S,ZˆC ) aimed at representing (S, C). (Zˆ S,ZˆC ) sum to Zˆ, which conditions a diffusion model Gψ aimed at generating an approximation Xˆ of X. In this section, we introduce the additive structural assumptio… view at source ↗

**Figure 3.** Figure 3: Our 2-Stage training protocol. Left: Conditioning pipeline. An input image x1 is mapped to a latent u0 by a VAE encoder and to DINOv3 features (a). A cross-query module (b) produces semantic tokens T, and a small CNN (c) produces a color token tcol. The concatenated tokens Z condition a U-Net diffusion model via cross-attention (d). Right: Training pipeline of our CommonSalient encoder. A CLS token is pre… view at source ↗

**Figure 4.** Figure 4: Left: Comparison of image reconstruction on CelebA-HQ 256×256. This serves as a sanity check to make sure that the images produced by our model are of good quality. Our methods (CQ and CQC) present competitive reconstructions compared to DINOv3 with fewer tokens, while also being able to carry out editing and image control (which DINOv3 cannot). Right: Exploring the Z-space. We linearly interpolate between… view at source ↗

**Figure 5.** Figure 5: For each dataset, the first row represents two real images from background and target while [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Salient interpolation: We fix the common ZˆC of the left image and interpolate the salient Zˆ S with the salient of the right image. Top: The style of the glasses is progressively transferred. Bottom: cat features (ears, whiskers) progressively appear. 2 0 2 PCA Dimension 1 1.0 0.5 0.0 0.5 PCA Dimension 2 PCA of Salient Vectors 0 10 UMAP Dimension 1 2.5 0.0 2.5 5.0 7.5 10.0 UMAP Dimension 2 UMAP of Salient… view at source ↗

**Figure 7.** Figure 7: PCA and UMAP projection of Zˆ S. Even though Zˆ S was only learned using weak binary supervision, fine-grained subclasses are still separated. Finally, [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗

**Figure 8.** Figure 8: Failure cases of some prompt-based diffusion editing methods. Left: FLUX.1 Kontext adds the same generic sunglasses to everyone. Middle: Nano Banana Pro correctly removes the tumor but fails to generate an anatomically realistic image (the generated brain gyrification is not realistic), and it alters the healthy anatomy (ventricles in the bottom image). Right: Nano Banana Pro fails to swap head position wi… view at source ↗

**Figure 9.** Figure 9: Failure cases of Nano Banana 2 (Gemini 3 Flash Image). Left: We use a single real image and a prompt to guide the editing. All generated images are either anatomically unrealistic or show altered anatomy (orange arrows). Right: Two real images (without and with a tumor) are provided as input, along with a prompt to guide the editing. This setup more closely resembles the goal of Contrastive Analysis (CA), … view at source ↗

**Figure 10.** Figure 10: Failure cases of Nano Banana Pro for a head position transfer for two different prompts, [PITH_FULL_IMAGE:figures/full_fig_p016_10.png] view at source ↗

**Figure 11.** Figure 11: Diagram of our Cross-query extractor module [PITH_FULL_IMAGE:figures/full_fig_p021_11.png] view at source ↗

**Figure 12.** Figure 12: Reconstruction from random noise for different conditioning encoders. Our conditioning [PITH_FULL_IMAGE:figures/full_fig_p022_12.png] view at source ↗

**Figure 13.** Figure 13: Principal directions of the Z-space. The first one corresponds to head position, the second one to gender. • Interpolation [PITH_FULL_IMAGE:figures/full_fig_p022_13.png] view at source ↗

**Figure 14.** Figure 14: Visual inspection of the BraTS 2023 autoencoding using [PITH_FULL_IMAGE:figures/full_fig_p024_14.png] view at source ↗

**Figure 15.** Figure 15: Cross-attention maps for some conditioning tokens, comparing our conditioning to raw [PITH_FULL_IMAGE:figures/full_fig_p026_15.png] view at source ↗

**Figure 16.** Figure 16: Comparing reconstruction and swapping between Diff-CA (ours) and CA baselines. The [PITH_FULL_IMAGE:figures/full_fig_p027_16.png] view at source ↗

**Figure 17.** Figure 17: Dependence of the generated images on different starting noises. [PITH_FULL_IMAGE:figures/full_fig_p027_17.png] view at source ↗

**Figure 18.** Figure 18: Effect of the color token. Top row: original images. Middle row: reconstruction using T and tcol. Bottom row: we permute the color tokens tcol between the 4 images and reconstruct from this modified conditioning. The color token tcol contains color histogram information. Encoder Steps Cond. Size PSNR↑ SSIM↑ LPIPS↓ DISTS↓ FID↓ FD-DINO↓ KID↓ ID-Sim↑ DiffAE 20 512 19.09 0.530 0.209 0.173 14.28 32.89 0.013 0.… view at source ↗

**Figure 19.** Figure 19: Training dynamics of the self-tuning GRL parameter [PITH_FULL_IMAGE:figures/full_fig_p030_19.png] view at source ↗

**Figure 20.** Figure 20: Training adversarial accuracy using our self-tuning GRL schedule. After a chaotic warmup [PITH_FULL_IMAGE:figures/full_fig_p030_20.png] view at source ↗

**Figure 21.** Figure 21: Example of swapping with a model trained [PITH_FULL_IMAGE:figures/full_fig_p031_21.png] view at source ↗

**Figure 22.** Figure 22: Additional swapping results on glasses in FFHQ using Diff-CA. [PITH_FULL_IMAGE:figures/full_fig_p034_22.png] view at source ↗

**Figure 23.** Figure 23: Additional swapping results on the head position in FFHQ using Diff-CA. [PITH_FULL_IMAGE:figures/full_fig_p035_23.png] view at source ↗

**Figure 24.** Figure 24: Additional swapping results on the cats/togs classes in AFHQ using Diff-CA. [PITH_FULL_IMAGE:figures/full_fig_p035_24.png] view at source ↗

**Figure 25.** Figure 25: Additional swapping and interpolation results on the BraTS 2023 dataset using Diff-CA. [PITH_FULL_IMAGE:figures/full_fig_p036_25.png] view at source ↗

read the original abstract

Contrastive Analysis aims to separate factors that are common between two data distributions from those that are salient to only one of them. Existing contrastive methods are based on generative models (e.g., VAEs or GANs) that often suffer from limited reconstruction and image quality, which hampers effective latent factor separation and limits their applicability to high-fidelity image generation and edition. We propose a novel conditioning framework for diffusion models that enables contrastive decomposition without compromising generation quality. We first train a prompt-free, image-conditioned diffusion model, and then learn to decompose the conditioning into a common and a salient factor, using weak supervision. We prove that the additive contrastive factorization, commonly assumed in prior work, is identifiable under mild conditions. This factorization enables targeted operations by swapping or interpolating only the salient factor.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Diff-CA brings diffusion conditioning to contrastive factor separation plus an identifiability proof, but the proof details and weak supervision need checking to judge how far it goes.

read the letter

The paper's main move is to train a prompt-free image-conditioned diffusion model first, then decompose its conditioning into common and salient factors with weak supervision. This keeps the high-fidelity generation that VAEs and GANs usually lose in contrastive analysis work. They also claim a proof that the additive contrastive factorization is identifiable under mild conditions.

What is actually new is the conditioning framework itself and the fact that the identifiability result is stated for diffusion models rather than the earlier generative setups cited in the abstract. The practical side is straightforward: once the split is learned, you can swap or interpolate only the salient factor for editing tasks.

The paper does a clean job on the motivation and on showing why diffusion quality matters for this kind of decomposition. The editing operations follow directly from the factorization.

Soft spots are the usual ones for a methods paper with a proof claim. The mild conditions are not spelled out at the abstract level, so it matters whether they are realistic for image data or whether they hide stronger assumptions. The weak supervision step is mentioned but its exact form, label requirements, and sensitivity are not visible here; that could affect how reproducible the decomposition is. If the experiments stay mostly qualitative, that would be a minor gap rather than a fatal one.

This is for people already working on disentanglement or high-quality image editing in computer vision. A reader who cares about diffusion conditioning tricks or contrastive generative models would get the most out of it.

It deserves peer review. The core idea is coherent and the claims are checkable with the right experiments and derivation.

Referee Report

2 major / 0 minor

Summary. The paper proposes Diff-CA, a conditioning framework for diffusion models to perform contrastive analysis by separating common factors (shared across distributions) from salient factors (unique to one). It first trains a prompt-free image-conditioned diffusion model, then decomposes the conditioning into common and salient factors via weak supervision. The central claim is a proof that the additive contrastive factorization (commonly assumed in prior work) is identifiable under mild conditions, enabling targeted editing operations such as swapping or interpolating only the salient factor while preserving generation quality.

Significance. If the identifiability result holds under the stated mild conditions and the weak supervision details are made explicit with supporting experiments, the work would meaningfully advance contrastive analysis by integrating it with high-fidelity diffusion models, overcoming the reconstruction and quality limitations of prior VAE- and GAN-based approaches.

major comments (2)

[Abstract] Abstract: the claim that 'we prove that the additive contrastive factorization... is identifiable under mild conditions' is presented without the derivation, explicit statement of the mild conditions, error analysis, or experimental validation that those conditions hold in the tested image domains; this renders the central claim unverifiable from the provided text.
The weak supervision procedure for decomposing the conditioning into common and salient factors is described at a high level but the precise form of supervision, loss terms, and how it interacts with the diffusion training objective are unspecified, which is load-bearing for reproducibility and for confirming that the factorization remains identifiable in practice.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments. We address each major comment below, clarifying the location of the proof and details in the manuscript while committing to revisions for improved clarity and reproducibility.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that 'we prove that the additive contrastive factorization... is identifiable under mild conditions' is presented without the derivation, explicit statement of the mild conditions, error analysis, or experimental validation that those conditions hold in the tested image domains; this renders the central claim unverifiable from the provided text.

Authors: The full derivation of the identifiability result, including the explicit mild conditions (additive factorization with weak supervision on paired distributions), is provided in Section 3, along with the complete proof. Experimental validation that the conditions hold for the tested image domains, including quantitative checks, appears in Section 4. We agree the abstract is too concise to include these elements. In revision we will expand the abstract to state the mild conditions explicitly, while keeping error analysis in the experiments section due to space limits. revision: yes
Referee: The weak supervision procedure for decomposing the conditioning into common and salient factors is described at a high level but the precise form of supervision, loss terms, and how it interacts with the diffusion training objective are unspecified, which is load-bearing for reproducibility and for confirming that the factorization remains identifiable in practice.

Authors: We agree the current description is high-level. Section 3.2 specifies the weak supervision (using weak labels on paired samples indicating shared vs. unique factors), the exact loss terms (contrastive decomposition losses added to the diffusion denoising objective), and their interaction. To improve reproducibility we will revise Section 3.2 to include the full loss equations, hyperparameter settings, and additional pseudocode showing how the decomposition is trained jointly with the diffusion model. revision: yes

Circularity Check

0 steps flagged

No significant circularity: identifiability result presented as independent proof

full rationale

The paper's central claim is a mathematical proof that the additive contrastive factorization is identifiable under mild conditions. This is described as independent of the training procedure and not derived from fitted parameters or self-citations. No load-bearing steps reduce by construction to inputs, self-definitions, or author-specific uniqueness theorems. The derivation is self-contained as a standard identifiability argument.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Review performed on abstract only; full technical details unavailable.

axioms (1)

domain assumption Data follows an additive contrastive factorization into common and salient factors
Stated as commonly assumed in prior work and proved identifiable under mild conditions.

pith-pipeline@v0.9.1-grok · 5679 in / 1074 out tokens · 28808 ms · 2026-06-28T01:45:00.477025+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

59 extracted references · 10 canonical work pages · 6 internal anchors

[1]

Contrastive Variational Autoencoder Enhances Salient Features

Abubakar Abid and James Zou. Contrastive variational autoencoder enhances salient features.arXiv preprint arXiv:1902.04601, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1902
[2]

Exploring patterns enriched in a dataset with contrastive principal component analysis.Nature communications, 9(1):2134, 2018

Abubakar Abid, Martin J Zhang, Vivek K Bagaria, and James Zou. Exploring patterns enriched in a dataset with contrastive principal component analysis.Nature communications, 9(1):2134, 2018

2018
[3]

Domain intersection and domain difference

Sagie Benaim, Michael Khaitov, Tomer Galanti, and Lior Wolf. Domain intersection and domain difference. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 3453–3462, 2019

2019
[4]

Domain separation networks

Konstantinos Bousmalis, George Trigeorgis, Nathan Silberman, Dilip Krishnan, and Dumitru Erhan. Domain separation networks. InAdvances in Neural Information Processing Systems, 2016

2016
[5]

Instructpix2pix: Learning to follow image editing instructions

Tim Brooks, Aleksander Holynski, and Alexei A Efros. Instructpix2pix: Learning to follow image editing instructions. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18392–18402, 2023

2023
[6]

Double infogan for contrastive analysis

Florence Carton, Robin Louiset, and Pietro Gori. Double infogan for contrastive analysis. InInternational Conference on Artificial Intelligence and Statistics, pages 172–180. PMLR, 2024. 10

2024
[7]

Infogan: Interpretable representation learning by information maximizing generative adversarial nets.Advances in neural information processing systems, 29, 2016

Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel. Infogan: Interpretable representation learning by information maximizing generative adversarial nets.Advances in neural information processing systems, 29, 2016

2016
[8]

Stargan v2: Diverse image synthesis for multiple domains

Yunjey Choi, Youngjung Uh, Jaejun Yoo, and Jung-Woo Ha. Stargan v2: Diverse image synthesis for multiple domains. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2020

2020
[9]

In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Jiankang Deng, Jia Guo, Niannan Xue, and Stefanos Zafeiriou. Arcface: Additive angular margin loss for deep face recognition. In2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4685–4694, 2019. doi: 10.1109/CVPR.2019.00482

work page doi:10.1109/cvpr.2019.00482 2019
[10]

Lightweight face recognition challenge

Jiankang Deng, Jia Guo, Debing Zhang, Yafeng Deng, Xiangju Lu, and Song Shi. Lightweight face recognition challenge. In2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW), pages 2638–2646, 2019. doi: 10.1109/ICCVW.2019.00322

work page doi:10.1109/iccvw.2019.00322 2019
[11]

Integrating Prior Knowledge in Contrastive Learning with Kernel

Benoit Dufumier, Carlo Alberto Barbano, Robin Louiset, Edouard Duchesnay, and Pietro Gori. Integrating Prior Knowledge in Contrastive Learning with Kernel. InInternational Conference on Machine Learning (ICML), 2023

2023
[12]

What to align in multimodal contrastive learning? InInternational Conference on Learning Representations (ICLR), 2025

Benoit Dufumier, Javiera Castillo-Navarro, Devis Tuia, and Jean-Philippe Thiran. What to align in multimodal contrastive learning? InInternational Conference on Learning Representations (ICLR), 2025

2025
[13]

Learning robust represen- tations via multi-view information bottleneck

Marco Federici, Anjan Dutta, Patrick Forré, Nate Kushman, and Zeynep Akata. Learning robust represen- tations via multi-view information bottleneck. InInternational Conference on Learning Representations, 2020

2020
[14]

Unsupervised domain adaptation by backpropagation

Yaroslav Ganin and Victor Lempitsky. Unsupervised domain adaptation by backpropagation. InInterna- tional conference on machine learning, pages 1180–1189. PMLR, 2015

2015
[15]

Image-to-image translation for cross- domain disentanglement

Abel Gonzalez-Garcia, Joost van de Weijer, and Yoshua Bengio. Image-to-image translation for cross- domain disentanglement. InAdvances in Neural Information Processing Systems, 2018

2018
[16]

Nano banana pro: Advanced visual reasoning and editing with gemini 3 pro im- age

Google DeepMind. Nano banana pro: Advanced visual reasoning and editing with gemini 3 pro im- age. Technical report, Google, November 2025. URL https://ai.google.dev/gemini-api/docs/ image-generation

2025
[17]

Bootstrap your own latent-a new approach to self-supervised learning.Advances in neural information processing systems, 33:21271–21284, 2020

Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin Tallec, Pierre Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Guo, Mohammad Gheshlaghi Azar, et al. Bootstrap your own latent-a new approach to self-supervised learning.Advances in neural information processing systems, 33:21271–21284, 2020

2020
[18]

Saga: Learning signal-aligned distributions for improved text-to-image generation, 2025

Paul Grimal, Michaël Soumm, Hervé Le Borgne, Olivier Ferret, and Akihiro Sugimoto. Saga: Learning signal-aligned distributions for improved text-to-image generation, 2025. URL https://arxiv.org/ abs/2508.13866

work page arXiv 2025
[19]

Initno: Boosting text-to-image diffusion models via initial noise optimization

Xiefan Guo, Jinlin Liu, Miaomiao Cui, Jiankai Li, Hongyu Yang, and Di Huang. Initno: Boosting text-to-image diffusion models via initial noise optimization. InCVPR, 2024

2024
[20]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016

2016
[21]

Learning Common and Salient Generative Factors Between Two Image Datasets, 2025

Yunlong He, Gwilherm Lesné, Ziqian Liu, Michaël Soumm, and Pietro Gori. Learning Common and Salient Generative Factors Between Two Image Datasets, 2025

2025
[22]

Prompt-to- prompt image editing with cross-attention control

Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman, Yael Pritch, and Daniel Cohen-or. Prompt-to- prompt image editing with cross-attention control. InThe Eleventh International Conference on Learning Representations, 2023

2023
[23]

beta-vae: Learning basic visual concepts with a constrained variational framework

Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick, Shakir Mohamed, and Alexander Lerchner. beta-vae: Learning basic visual concepts with a constrained variational framework. InInternational conference on learning representations, 2017

2017
[24]

Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

2020
[25]

Perceiver IO: A General Architecture for Structured Inputs & Outputs

Andrew Jaegle, Sebastian Borgeaud, Jean-Baptiste Alayrac, Carl Doersch, Catalin Ionescu, David Ding, Skanda Koppula, Daniel Zoran, Andrew Brock, Evan Shelhamer, et al. Perceiver io: A general architecture for structured inputs & outputs.arXiv preprint arXiv:2107.14795, 2021. 11

work page internal anchor Pith review Pith/arXiv arXiv 2021
[26]

Training-free content injection using h-space in diffusion models

Jaeseok Jeong, Mingi Kwon, and Youngjung Uh. Training-free content injection using h-space in diffusion models. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 5151–5161, 2024

2024
[27]

A style-based generator architecture for generative adversarial networks

Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4401–4410, 2019

2019
[28]

The brain tumor segmentation (brats) challenge 2023: Focus on pediatrics (cbtn-connect-dipgr-asnr-miccai brats-peds)

Anahita Fathi Kazerooni, Nastaran Khalili, Xinyang Liu, Debanjan Haldar, Zhifan Jiang, Syed Muhammed Anwar, Jake Albrecht, Maruf Adewole, Udunna Anazodo, Hannah Anderson, et al. The brain tumor segmentation (brats) challenge 2023: Focus on pediatrics (cbtn-connect-dipgr-asnr-miccai brats-peds). ArXiv, pages arXiv–2305, 2024

2023
[29]

Michael Kleinman, Alessandro Achille, Stefano Soatto, and Jonathan C. Kao. Gács–körner common information variational autoencoder. InAdvances in Neural Information Processing Systems, 2023

2023
[30]

Diffusion models already have a semantic latent space

Mingi Kwon, Jaeseok Jeong, and Youngjung Uh. Diffusion models already have a semantic latent space. InThe Eleventh International Conference on Learning Representations, 2023

2023
[31]

Dranet: Disentangling representation and adaptation networks for unsupervised cross-domain adaptation

Seunghun Lee, Sunghyun Cho, and Sunghoon Im. Dranet: Disentangling representation and adaptation networks for unsupervised cross-domain adaptation. InCVPR, pages 15252–15261, 2021

2021
[32]

Flow Matching for Generative Modeling

Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[33]

Deep learning face attributes in the wild

Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild. In Proceedings of International Conference on Computer Vision (ICCV), 2015

2015
[34]

Challenging common assumptions in the unsupervised learning of disentangled representations

Francesco Locatello, Stefan Bauer, Mario Lucic, Gunnar Raetsch, Sylvain Gelly, Bernhard Schölkopf, and Olivier Bachem. Challenging common assumptions in the unsupervised learning of disentangled representations. Ininternational conference on machine learning, pages 4114–4124. PMLR, 2019

2019
[35]

UCSL : A Machine Learning Expectation-Maximization Framework for Unsupervised Clustering Driven by Supervised Learning

Robin Louiset, Pietro Gori, Benoit Dufumier, Josselin Houenou, Antoine Grigis, and Edouard Duchesnay. UCSL : A Machine Learning Expectation-Maximization Framework for Unsupervised Clustering Driven by Supervised Learning. InMachine Learning and Knowledge Discovery in Databases. Research Track, 2021

2021
[36]

Sepvae: a contrastive vae to separate pathological patterns from healthy ones

Robin Louiset, Edouard Duchesnay, Grigis Antoine, Benoit Dufumier, and Pietro Gori. Sepvae: a contrastive vae to separate pathological patterns from healthy ones. In Ninon Burgos, Caroline Petitjean, Maria Vakalopoulou, Stergios Christodoulidis, Pierrick Coupe, Hervé Delingette, Carole Lartizien, and Diana Mateus, editors,Proceedings of The 7nd Internatio...
[37]

URLhttps://proceedings.mlr.press/v250/louiset24a.html
[38]

Separating common from salient patterns with contrastive representation learning

Robin Louiset, Edouard Duchesnay, Antoine Grigis, and Pietro Gori. Separating common from salient patterns with contrastive representation learning. InThe Twelfth International Conference on Learning Representations, 2024. URLhttps://openreview.net/forum?id=30N3bNAiw3

2024
[39]

Automatic Discovery of Disease Subgroups by Contrasting with Healthy Controls.Data Mining and Knowledge Discovery, 2026

Robin Louiset, Edouard Duchesnay, Benoit Dufumier, Antoine Grigis, and Pietro Gori. Automatic Discovery of Disease Subgroups by Contrasting with Healthy Controls.Data Mining and Knowledge Discovery, 2026

2026
[40]

Null-text inversion for editing real images using guided diffusion models

Ron Mokady, Amir Hertz, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. Null-text inversion for editing real images using guided diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6038–6047, 2023

2023
[41]

Understanding the latent space of diffusion models through the lens of riemannian geometry.Advances in Neural Information Processing Systems, 36:24129–24142, 2023

Yong-Hyun Park, Mingi Kwon, Jaewoong Choi, Junghyo Jo, and Youngjung Uh. Understanding the latent space of diffusion models through the lens of riemannian geometry.Advances in Neural Information Processing Systems, 36:24129–24142, 2023

2023
[42]

Zero-shot image-to-image translation

Gaurav Parmar, Krishna Kumar Singh, Richard Zhang, Yijun Li, Jingwan Lu, and Jun-Yan Zhu. Zero-shot image-to-image translation. InACM SIGGRAPH 2023 conference proceedings, pages 1–11, 2023

2023
[43]

Diffusion autoencoders: Toward a meaningful and decodable representation

Konpat Preechakul, Nattanat Chatthee, Suttisak Wizadwongsa, and Supasorn Suwajanakorn. Diffusion autoencoders: Toward a meaningful and decodable representation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10619–10629, 2022. 12

2022
[44]

Shared Independent Component Analysis for Multi-Subject Neuroimaging

Hugo Richard, Pierre Ablin, Bertrand Thirion, Alexandre Gramfort, and Aapo Hyvarinen. Shared Independent Component Analysis for Multi-Subject Neuroimaging. InNeurIPS, volume 34, pages 29962–29971, 2021

2021
[45]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022

2022
[46]

Dream- booth: Fine tuning text-to-image diffusion models for subject-driven generation

Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. Dream- booth: Fine tuning text-to-image diffusion models for subject-driven generation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 22500–22510, 2023

2023
[47]

Learning disentangled representations via mutual information estimation

Eduardo Hugo Sánchez, Mathieu Serrurier, and Mathias Ortner. Learning disentangled representations via mutual information estimation. InEuropean Conference on Computer Vision (ECCV), 2020

2020
[48]

Color alignment in diffusion

Ka Chun Shum, Binh-Son Hua, Duc Thanh Nguyen, and Sai-Kit Yeung. Color alignment in diffusion. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 28446–28455, 2025

2025
[49]

DINOv3

Oriane Siméoni, Huy V V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Michaël Ramamonjisoa, et al. Dinov3.arXiv preprint arXiv:2508.10104, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[50]

Contrastive multiview coding

Yonglong Tian, Dilip Krishnan, and Phillip Isola. Contrastive multiview coding. InECCV, pages 776–794, 2020

2020
[51]

Improving and generalizing flow-based generative models with minibatch optimal transport.Transactions on Machine Learning Research, 2024

Alexander Tong, Kilian FATRAS, Nikolay Malkin, Guillaume Huguet, Yanlei Zhang, Jarrid Rector-Brooks, Guy Wolf, and Yoshua Bengio. Improving and generalizing flow-based generative models with minibatch optimal transport.Transactions on Machine Learning Research, 2024. ISSN 2835-8856

2024
[52]

Deep variational canonical correlation analysis

Weiran Wang, Xinchen Yan, Honglak Lee, and Karen Livescu. Deep variational canonical correlation analysis. InInternational Conference on Learning Representations (ICLR), 2017

2017
[53]

Moment matching deep contrastive latent variable models.arXiv preprint arXiv:2202.10560, 2022

Ethan Weinberger, Nicasia Beebe-Wang, and Su-In Lee. Moment matching deep contrastive latent variable models.arXiv preprint arXiv:2202.10560, 2022

work page arXiv 2022
[54]

Diffusion model with cross attention as an inductive bias for disentanglement.Advances in Neural Information Processing Systems, 37:82465–82492, 2024

Tao Yang, Cuiling Lan, Yan Lu, and Nanning Zheng. Diffusion model with cross attention as an inductive bias for disentanglement.Advances in Neural Information Processing Systems, 37:82465–82492, 2024

2024
[55]

IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models

Hu Ye, Jun Zhang, Sibo Liu, Xiao Han, and Wei Yang. Ip-adapter: Text compatible image prompt adapter for text-to-image diffusion models.arXiv preprint arXiv:2308.06721, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[56]

Anti-exposure bias in diffusion models

Junyu Zhang, Daochang Liu, Eunbyung Park, Shichao Zhang, and Chang Xu. Anti-exposure bias in diffusion models. InThe Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=MtDd7rWok1

2025
[57]

Adding conditional control to text-to-image diffusion models

Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. InProceedings of the IEEE/CVF international conference on computer vision, pages 3836–3847, 2023

2023
[58]

Diffusion Transformers with Representation Autoencoders

Boyang Zheng, Nanye Ma, Shengbang Tong, and Saining Xie. Diffusion transformers with representation autoencoders.arXiv preprint arXiv:2510.11690, 2025. 13 Appendix A: Broader Impact Our method has the potential to significantly impact domains such as medical imaging, where diffusion-based models combined with contrastive analysis could help identify subtl...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[59]

eyes and mouth

The learning rate is warmed up from 10−6 to 10−4 for 5000 steps, then kept constant at 10−4 for the rest of the training. All models are trained in mixed precision with bfloat16. Training was performed on a single NVIDIA H100 GPU, equipped with an Intel Xeon Platinum 8468 CPU with 24 active cores (workers). Training takes about 48 hours on this configurat...

2032

[1] [1]

Contrastive Variational Autoencoder Enhances Salient Features

Abubakar Abid and James Zou. Contrastive variational autoencoder enhances salient features.arXiv preprint arXiv:1902.04601, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1902

[2] [2]

Exploring patterns enriched in a dataset with contrastive principal component analysis.Nature communications, 9(1):2134, 2018

Abubakar Abid, Martin J Zhang, Vivek K Bagaria, and James Zou. Exploring patterns enriched in a dataset with contrastive principal component analysis.Nature communications, 9(1):2134, 2018

2018

[3] [3]

Domain intersection and domain difference

Sagie Benaim, Michael Khaitov, Tomer Galanti, and Lior Wolf. Domain intersection and domain difference. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 3453–3462, 2019

2019

[4] [4]

Domain separation networks

Konstantinos Bousmalis, George Trigeorgis, Nathan Silberman, Dilip Krishnan, and Dumitru Erhan. Domain separation networks. InAdvances in Neural Information Processing Systems, 2016

2016

[5] [5]

Instructpix2pix: Learning to follow image editing instructions

Tim Brooks, Aleksander Holynski, and Alexei A Efros. Instructpix2pix: Learning to follow image editing instructions. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18392–18402, 2023

2023

[6] [6]

Double infogan for contrastive analysis

Florence Carton, Robin Louiset, and Pietro Gori. Double infogan for contrastive analysis. InInternational Conference on Artificial Intelligence and Statistics, pages 172–180. PMLR, 2024. 10

2024

[7] [7]

Infogan: Interpretable representation learning by information maximizing generative adversarial nets.Advances in neural information processing systems, 29, 2016

Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel. Infogan: Interpretable representation learning by information maximizing generative adversarial nets.Advances in neural information processing systems, 29, 2016

2016

[8] [8]

Stargan v2: Diverse image synthesis for multiple domains

Yunjey Choi, Youngjung Uh, Jaejun Yoo, and Jung-Woo Ha. Stargan v2: Diverse image synthesis for multiple domains. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2020

2020

[9] [9]

In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Jiankang Deng, Jia Guo, Niannan Xue, and Stefanos Zafeiriou. Arcface: Additive angular margin loss for deep face recognition. In2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4685–4694, 2019. doi: 10.1109/CVPR.2019.00482

work page doi:10.1109/cvpr.2019.00482 2019

[10] [10]

Lightweight face recognition challenge

Jiankang Deng, Jia Guo, Debing Zhang, Yafeng Deng, Xiangju Lu, and Song Shi. Lightweight face recognition challenge. In2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW), pages 2638–2646, 2019. doi: 10.1109/ICCVW.2019.00322

work page doi:10.1109/iccvw.2019.00322 2019

[11] [11]

Integrating Prior Knowledge in Contrastive Learning with Kernel

Benoit Dufumier, Carlo Alberto Barbano, Robin Louiset, Edouard Duchesnay, and Pietro Gori. Integrating Prior Knowledge in Contrastive Learning with Kernel. InInternational Conference on Machine Learning (ICML), 2023

2023

[12] [12]

What to align in multimodal contrastive learning? InInternational Conference on Learning Representations (ICLR), 2025

Benoit Dufumier, Javiera Castillo-Navarro, Devis Tuia, and Jean-Philippe Thiran. What to align in multimodal contrastive learning? InInternational Conference on Learning Representations (ICLR), 2025

2025

[13] [13]

Learning robust represen- tations via multi-view information bottleneck

Marco Federici, Anjan Dutta, Patrick Forré, Nate Kushman, and Zeynep Akata. Learning robust represen- tations via multi-view information bottleneck. InInternational Conference on Learning Representations, 2020

2020

[14] [14]

Unsupervised domain adaptation by backpropagation

Yaroslav Ganin and Victor Lempitsky. Unsupervised domain adaptation by backpropagation. InInterna- tional conference on machine learning, pages 1180–1189. PMLR, 2015

2015

[15] [15]

Image-to-image translation for cross- domain disentanglement

Abel Gonzalez-Garcia, Joost van de Weijer, and Yoshua Bengio. Image-to-image translation for cross- domain disentanglement. InAdvances in Neural Information Processing Systems, 2018

2018

[16] [16]

Nano banana pro: Advanced visual reasoning and editing with gemini 3 pro im- age

Google DeepMind. Nano banana pro: Advanced visual reasoning and editing with gemini 3 pro im- age. Technical report, Google, November 2025. URL https://ai.google.dev/gemini-api/docs/ image-generation

2025

[17] [17]

Bootstrap your own latent-a new approach to self-supervised learning.Advances in neural information processing systems, 33:21271–21284, 2020

Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin Tallec, Pierre Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Guo, Mohammad Gheshlaghi Azar, et al. Bootstrap your own latent-a new approach to self-supervised learning.Advances in neural information processing systems, 33:21271–21284, 2020

2020

[18] [18]

Saga: Learning signal-aligned distributions for improved text-to-image generation, 2025

Paul Grimal, Michaël Soumm, Hervé Le Borgne, Olivier Ferret, and Akihiro Sugimoto. Saga: Learning signal-aligned distributions for improved text-to-image generation, 2025. URL https://arxiv.org/ abs/2508.13866

work page arXiv 2025

[19] [19]

Initno: Boosting text-to-image diffusion models via initial noise optimization

Xiefan Guo, Jinlin Liu, Miaomiao Cui, Jiankai Li, Hongyu Yang, and Di Huang. Initno: Boosting text-to-image diffusion models via initial noise optimization. InCVPR, 2024

2024

[20] [20]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016

2016

[21] [21]

Learning Common and Salient Generative Factors Between Two Image Datasets, 2025

Yunlong He, Gwilherm Lesné, Ziqian Liu, Michaël Soumm, and Pietro Gori. Learning Common and Salient Generative Factors Between Two Image Datasets, 2025

2025

[22] [22]

Prompt-to- prompt image editing with cross-attention control

Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman, Yael Pritch, and Daniel Cohen-or. Prompt-to- prompt image editing with cross-attention control. InThe Eleventh International Conference on Learning Representations, 2023

2023

[23] [23]

beta-vae: Learning basic visual concepts with a constrained variational framework

Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick, Shakir Mohamed, and Alexander Lerchner. beta-vae: Learning basic visual concepts with a constrained variational framework. InInternational conference on learning representations, 2017

2017

[24] [24]

Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

2020

[25] [25]

Perceiver IO: A General Architecture for Structured Inputs & Outputs

Andrew Jaegle, Sebastian Borgeaud, Jean-Baptiste Alayrac, Carl Doersch, Catalin Ionescu, David Ding, Skanda Koppula, Daniel Zoran, Andrew Brock, Evan Shelhamer, et al. Perceiver io: A general architecture for structured inputs & outputs.arXiv preprint arXiv:2107.14795, 2021. 11

work page internal anchor Pith review Pith/arXiv arXiv 2021

[26] [26]

Training-free content injection using h-space in diffusion models

Jaeseok Jeong, Mingi Kwon, and Youngjung Uh. Training-free content injection using h-space in diffusion models. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 5151–5161, 2024

2024

[27] [27]

A style-based generator architecture for generative adversarial networks

Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4401–4410, 2019

2019

[28] [28]

The brain tumor segmentation (brats) challenge 2023: Focus on pediatrics (cbtn-connect-dipgr-asnr-miccai brats-peds)

Anahita Fathi Kazerooni, Nastaran Khalili, Xinyang Liu, Debanjan Haldar, Zhifan Jiang, Syed Muhammed Anwar, Jake Albrecht, Maruf Adewole, Udunna Anazodo, Hannah Anderson, et al. The brain tumor segmentation (brats) challenge 2023: Focus on pediatrics (cbtn-connect-dipgr-asnr-miccai brats-peds). ArXiv, pages arXiv–2305, 2024

2023

[29] [29]

Michael Kleinman, Alessandro Achille, Stefano Soatto, and Jonathan C. Kao. Gács–körner common information variational autoencoder. InAdvances in Neural Information Processing Systems, 2023

2023

[30] [30]

Diffusion models already have a semantic latent space

Mingi Kwon, Jaeseok Jeong, and Youngjung Uh. Diffusion models already have a semantic latent space. InThe Eleventh International Conference on Learning Representations, 2023

2023

[31] [31]

Dranet: Disentangling representation and adaptation networks for unsupervised cross-domain adaptation

Seunghun Lee, Sunghyun Cho, and Sunghoon Im. Dranet: Disentangling representation and adaptation networks for unsupervised cross-domain adaptation. InCVPR, pages 15252–15261, 2021

2021

[32] [32]

Flow Matching for Generative Modeling

Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[33] [33]

Deep learning face attributes in the wild

Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild. In Proceedings of International Conference on Computer Vision (ICCV), 2015

2015

[34] [34]

Challenging common assumptions in the unsupervised learning of disentangled representations

Francesco Locatello, Stefan Bauer, Mario Lucic, Gunnar Raetsch, Sylvain Gelly, Bernhard Schölkopf, and Olivier Bachem. Challenging common assumptions in the unsupervised learning of disentangled representations. Ininternational conference on machine learning, pages 4114–4124. PMLR, 2019

2019

[35] [35]

UCSL : A Machine Learning Expectation-Maximization Framework for Unsupervised Clustering Driven by Supervised Learning

Robin Louiset, Pietro Gori, Benoit Dufumier, Josselin Houenou, Antoine Grigis, and Edouard Duchesnay. UCSL : A Machine Learning Expectation-Maximization Framework for Unsupervised Clustering Driven by Supervised Learning. InMachine Learning and Knowledge Discovery in Databases. Research Track, 2021

2021

[36] [36]

Sepvae: a contrastive vae to separate pathological patterns from healthy ones

Robin Louiset, Edouard Duchesnay, Grigis Antoine, Benoit Dufumier, and Pietro Gori. Sepvae: a contrastive vae to separate pathological patterns from healthy ones. In Ninon Burgos, Caroline Petitjean, Maria Vakalopoulou, Stergios Christodoulidis, Pierrick Coupe, Hervé Delingette, Carole Lartizien, and Diana Mateus, editors,Proceedings of The 7nd Internatio...

[37] [37]

URLhttps://proceedings.mlr.press/v250/louiset24a.html

[38] [38]

Separating common from salient patterns with contrastive representation learning

Robin Louiset, Edouard Duchesnay, Antoine Grigis, and Pietro Gori. Separating common from salient patterns with contrastive representation learning. InThe Twelfth International Conference on Learning Representations, 2024. URLhttps://openreview.net/forum?id=30N3bNAiw3

2024

[39] [39]

Automatic Discovery of Disease Subgroups by Contrasting with Healthy Controls.Data Mining and Knowledge Discovery, 2026

Robin Louiset, Edouard Duchesnay, Benoit Dufumier, Antoine Grigis, and Pietro Gori. Automatic Discovery of Disease Subgroups by Contrasting with Healthy Controls.Data Mining and Knowledge Discovery, 2026

2026

[40] [40]

Null-text inversion for editing real images using guided diffusion models

Ron Mokady, Amir Hertz, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. Null-text inversion for editing real images using guided diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6038–6047, 2023

2023

[41] [41]

Understanding the latent space of diffusion models through the lens of riemannian geometry.Advances in Neural Information Processing Systems, 36:24129–24142, 2023

Yong-Hyun Park, Mingi Kwon, Jaewoong Choi, Junghyo Jo, and Youngjung Uh. Understanding the latent space of diffusion models through the lens of riemannian geometry.Advances in Neural Information Processing Systems, 36:24129–24142, 2023

2023

[42] [42]

Zero-shot image-to-image translation

Gaurav Parmar, Krishna Kumar Singh, Richard Zhang, Yijun Li, Jingwan Lu, and Jun-Yan Zhu. Zero-shot image-to-image translation. InACM SIGGRAPH 2023 conference proceedings, pages 1–11, 2023

2023

[43] [43]

Diffusion autoencoders: Toward a meaningful and decodable representation

Konpat Preechakul, Nattanat Chatthee, Suttisak Wizadwongsa, and Supasorn Suwajanakorn. Diffusion autoencoders: Toward a meaningful and decodable representation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10619–10629, 2022. 12

2022

[44] [44]

Shared Independent Component Analysis for Multi-Subject Neuroimaging

Hugo Richard, Pierre Ablin, Bertrand Thirion, Alexandre Gramfort, and Aapo Hyvarinen. Shared Independent Component Analysis for Multi-Subject Neuroimaging. InNeurIPS, volume 34, pages 29962–29971, 2021

2021

[45] [45]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022

2022

[46] [46]

Dream- booth: Fine tuning text-to-image diffusion models for subject-driven generation

Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. Dream- booth: Fine tuning text-to-image diffusion models for subject-driven generation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 22500–22510, 2023

2023

[47] [47]

Learning disentangled representations via mutual information estimation

Eduardo Hugo Sánchez, Mathieu Serrurier, and Mathias Ortner. Learning disentangled representations via mutual information estimation. InEuropean Conference on Computer Vision (ECCV), 2020

2020

[48] [48]

Color alignment in diffusion

Ka Chun Shum, Binh-Son Hua, Duc Thanh Nguyen, and Sai-Kit Yeung. Color alignment in diffusion. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 28446–28455, 2025

2025

[49] [49]

DINOv3

Oriane Siméoni, Huy V V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Michaël Ramamonjisoa, et al. Dinov3.arXiv preprint arXiv:2508.10104, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[50] [50]

Contrastive multiview coding

Yonglong Tian, Dilip Krishnan, and Phillip Isola. Contrastive multiview coding. InECCV, pages 776–794, 2020

2020

[51] [51]

Improving and generalizing flow-based generative models with minibatch optimal transport.Transactions on Machine Learning Research, 2024

Alexander Tong, Kilian FATRAS, Nikolay Malkin, Guillaume Huguet, Yanlei Zhang, Jarrid Rector-Brooks, Guy Wolf, and Yoshua Bengio. Improving and generalizing flow-based generative models with minibatch optimal transport.Transactions on Machine Learning Research, 2024. ISSN 2835-8856

2024

[52] [52]

Deep variational canonical correlation analysis

Weiran Wang, Xinchen Yan, Honglak Lee, and Karen Livescu. Deep variational canonical correlation analysis. InInternational Conference on Learning Representations (ICLR), 2017

2017

[53] [53]

Moment matching deep contrastive latent variable models.arXiv preprint arXiv:2202.10560, 2022

Ethan Weinberger, Nicasia Beebe-Wang, and Su-In Lee. Moment matching deep contrastive latent variable models.arXiv preprint arXiv:2202.10560, 2022

work page arXiv 2022

[54] [54]

Diffusion model with cross attention as an inductive bias for disentanglement.Advances in Neural Information Processing Systems, 37:82465–82492, 2024

Tao Yang, Cuiling Lan, Yan Lu, and Nanning Zheng. Diffusion model with cross attention as an inductive bias for disentanglement.Advances in Neural Information Processing Systems, 37:82465–82492, 2024

2024

[55] [55]

IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models

Hu Ye, Jun Zhang, Sibo Liu, Xiao Han, and Wei Yang. Ip-adapter: Text compatible image prompt adapter for text-to-image diffusion models.arXiv preprint arXiv:2308.06721, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[56] [56]

Anti-exposure bias in diffusion models

Junyu Zhang, Daochang Liu, Eunbyung Park, Shichao Zhang, and Chang Xu. Anti-exposure bias in diffusion models. InThe Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=MtDd7rWok1

2025

[57] [57]

Adding conditional control to text-to-image diffusion models

Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. InProceedings of the IEEE/CVF international conference on computer vision, pages 3836–3847, 2023

2023

[58] [58]

Diffusion Transformers with Representation Autoencoders

Boyang Zheng, Nanye Ma, Shengbang Tong, and Saining Xie. Diffusion transformers with representation autoencoders.arXiv preprint arXiv:2510.11690, 2025. 13 Appendix A: Broader Impact Our method has the potential to significantly impact domains such as medical imaging, where diffusion-based models combined with contrastive analysis could help identify subtl...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[59] [59]

eyes and mouth

The learning rate is warmed up from 10−6 to 10−4 for 5000 steps, then kept constant at 10−4 for the rest of the training. All models are trained in mixed precision with bfloat16. Training was performed on a single NVIDIA H100 GPU, equipped with an Intel Xeon Platinum 8468 CPU with 24 active cores (workers). Training takes about 48 hours on this configurat...

2032