Assessing Sample Quality in Conditional Generation under Compositional Shift

Berker Demirel; Francesco Locatello; Marco Fumero; Theofanis Karaletsos; Valentino Maiorca

arxiv: 2606.09601 · v2 · pith:BWXNIRN7new · submitted 2026-06-08 · 💻 cs.LG

Assessing Sample Quality in Conditional Generation under Compositional Shift

Berker Demirel , Valentino Maiorca , Marco Fumero , Theofanis Karaletsos , Francesco Locatello This is my paper

Pith reviewed 2026-06-27 17:19 UTC · model grok-4.3

classification 💻 cs.LG

keywords conditional generationsample qualitycompositional shifttrust scorefaithfulnessrealismbiological imaginggenerative models

0 comments

The pith

A per-sample trust score using only training data ranks and filters conditional generations under compositional shift.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to evaluate samples from conditional generators when the requested condition combines observed attributes in ways not present in the training data. Standard quality metrics cannot be applied because they require samples from the target distribution, which is unavailable by definition in this extrapolative setting. The authors define a trust score from two quantities that can be estimated from the training distribution alone: global realism, which checks compatibility with the observed data manifold, and attribute-wise faithfulness, which checks whether the sample is closer to the requested attributes than to plausible alternatives. Under a mild coverage condition on the observed attributes, this score produces rankings that support filtering, ranking, and abstention decisions. The approach is shown to work on pretrained models and yields measurable gains in biological imaging and controlled vision tasks.

Core claim

The central claim is that the trust score recovers meaningful comparisons across extrapolated generations under a mild coverage condition on the observed attributes. The score is formed from estimates of global realism and attribute-wise faithfulness, both computable from the training distribution. These comparisons support filtering, ranking, and abstention, and the score applies directly to off-the-shelf pretrained conditional generators. In biological imaging, generations selected by the score preserve real morphological structure better and improve downstream predictive performance, with analogous gains on controlled vision benchmarks. The score can also be used during generation to abst

What carries the argument

The per-sample trust score that adds global realism (compatibility with the training data manifold) to attribute-wise faithfulness (closer match to requested attributes than to alternatives).

If this is right

The score enables effective filtering, ranking, and abstention of generations in the extrapolative regime.
It applies directly to off-the-shelf pretrained conditional models without retraining.
In biological imaging, selected samples preserve real morphological structure better than unselected ones.
Downstream predictive performance improves when using samples chosen by the score.
The score supports abstention decisions before full decoding during the generation process.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same score construction could be tested on generative models for text or audio where novel attribute combinations also arise.
The coverage condition points to a natural experiment: measure how ranking quality degrades as attribute coverage in the training set is deliberately thinned.
The score might be inserted into the sampling loop itself to steer generation toward higher-trust outputs without changing the underlying model.
It offers a concrete way to quantify the gap between in-distribution and out-of-composition performance that could be compared across different conditioning mechanisms.

Load-bearing premise

The mild coverage condition on the observed attributes is needed for the score to produce meaningful comparisons in the extrapolative regime.

What would settle it

A direct comparison in which real samples from the new attribute compositions become available and the score's ranking of generated samples disagrees with the ranking obtained from those real samples.

Figures

Figures reproduced from arXiv: 2606.09601 by Berker Demirel, Francesco Locatello, Marco Fumero, Theofanis Karaletsos, Valentino Maiorca.

**Figure 1.** Figure 1: Pipeline for trust scoring under compositional shift. A conditional diffusion model Gθ is queried with an unseen joint condition a ⋆ and produces candidate samples xˆ. Features from Φ are used to compute a realism term R, which measures proximity to the real training distribution, and a faithfulness term F, which measures alignment with the requested attribute values. Their sum T = R + F provides a calibra… view at source ↗

**Figure 2.** Figure 2: CelebA decile binning (REPA-DINOv3 held-out, DINOv3 scoring). ∆KID increases monotonically from bin 0 (best trust) to bin 9 (worst) (left), and correlates with downstream classification accuracy drops (right). Binning results with faithfulness and realism components, together with RxRx1 DINOv3 decile curves are reported in Section K 5.1.2 Trust rankings track sample quality and downstream utility We next … view at source ↗

**Figure 3.** Figure 3: Main RxRx1 CellProfiler validation (REPA-SigLIP marginal, SigLIP trust scoring). CP-space downstream classification by trust decile. Left: 4-way cell-type accuracy. Right: 50-way condition accuracy. Trust-ranked deciles show a clear correlation with the classification performance, showing that trust ordering improves utility in an interpretable morphology space independent of the DINOv3 validation encoder.… view at source ↗

**Figure 4.** Figure 4: Translator scoring across denoising (CelebA, Vanilla SiT-B/2, 250 steps). P95- real-threshold ∆KID improvement rises as the predicted-clean trajectory settles [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: contrasts post-generation and during-generation trust scoring. Post-generation scoring first completes denoising, decodes the final latent into an image, and applies the feature extractor Φ before evaluating the trust score. During-generation scoring instead maps an intermediate diffusion representation ht into the same Φ-compatible space using a learned translator T, bypassing decoding and feature extract… view at source ↗

**Figure 6.** Figure 6: Full CP-space decile downstream classification on RxRx1 (kept-621 features, SigLIP trust scoring). Each row is one marginal model; left: 4-way celltype accuracy by decile, right: 50-way combo accuracy. Solid blue: trust-ranked decile. The REPA-SigLIP row is highlighted in the main text. For the uninformative-feature sanity check we use a less aggressive variant — unfiltered — which keeps all 2415 columns t… view at source ↗

**Figure 7.** Figure 7: shows the answer. The five seen (single-attribute) conditions cluster at low ∆KID and low trust; unseen conditions spread outward as the Hamming distance from support grows. Crucially, this is not a binary seen-vs.-unseen effect: some unseen conditions at small Hamming distance achieve quality comparable to seen ones, and the score correctly assigns them correspondingly favorable trust. Conversely, larger-… view at source ↗

**Figure 8.** Figure 8: Trust / realism / faithfulness decomposition for the main-text [PITH_FULL_IMAGE:figures/full_fig_p034_8.png] view at source ↗

**Figure 9.** Figure 9: RxRx1 DINOv3 decile binning (REPA-DINOv3 held-out, DINOv3 scoring, 50- condition subset). These learned-encoder trends support the same ordering story as CelebA, but the main text emphasizes the CellProfiler morphology validation because it is independent of the learned DINOv3 validation encoder. During-generation decile binning [PITH_FULL_IMAGE:figures/full_fig_p034_9.png] view at source ↗

**Figure 10.** Figure 10: During-generation decile binning after translation. The translator features recover the monotonic trust trend without decoding the sample and re-encoding it through the feature extractor [PITH_FULL_IMAGE:figures/full_fig_p035_10.png] view at source ↗

**Figure 11.** Figure 11: RxRx1 timestep ablation (Vanilla SiT B/2, 250-step sampler, translator features). Red squares: P95-real ∆KID% of the FPR95-accepted subset against a condition-matched random subset, evaluated on intermediate predicted-clean latents xˆ0(k) projected through the translator. Blue circles: per-step L2 change in the VAE-decoded xˆ0. Dashed red line: post-generation DINOv3 oracle (+44.0%). Compared to [PITH_F… view at source ↗

read the original abstract

Conditional generators provide a natural tool for controllable generation, including settings where the desired condition is a new composition of observed attributes or experimental factors. In many applications, especially in scientific domains, such models are attractive to explore conditions for which real samples are rare, expensive, or not yet observed. However, this creates a circularity for evaluation: standard conditional quality metrics require a reference target distribution, but in the extrapolative regime that distribution is unavailable by definition. We address this problem with a post-hoc, per-sample trust score for assessing conditional samples using only the training distribution. The score combines two estimable quantities: global realism, measuring compatibility with the real data manifold, and attribute-wise faithfulness, measuring whether a sample is closer to the requested attributes than to plausible alternatives. We show that the score can recover meaningful comparisons across extrapolated generations, under a mild coverage condition on the observed attributes. These comparisons enable effective filtering, ranking, and abstention of generations and can be used directly on off-the-shelf pretrained models. In biological imaging, selected samples preserve real morphological structure better and improve downstream predictive performance, while similar gains are observed on controlled vision benchmarks. Finally, we show how the score can be applied during generation, enabling abstention before full decoding. Code is available at https://github.com/berkerdemirel/faithful-cond-gen.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper introduces a trust score for conditional samples under compositional shift using only training data, but the supporting 'mild coverage condition' looks under-specified.

read the letter

This paper's main contribution is a post-hoc trust score for evaluating individual samples from conditional generators in the compositional extrapolation setting. You score a generated sample by how well it matches the real data manifold overall and how much closer it is to the requested attributes than to alternatives, all estimated from the training distribution alone.

What stands out is the focus on scientific use cases where the target distribution for new attribute combinations is unavailable by definition. They demonstrate that this score can guide filtering and abstention, leading to better morphological preservation in biological imaging and improved predictive performance on vision tasks. The fact that it works on off-the-shelf models and can be applied during generation is practical.

The potential issue is the reliance on a mild coverage condition for the score to give reliable comparisons in the extrapolative regime. The abstract mentions it but provides no explicit definition or proof sketch here, so it's hard to judge if it's mild enough to be useful or if it actually controls the bias in the faithfulness term for real shifts. The empirical results would need to test this carefully across different types of compositional changes.

Overall, this is aimed at researchers working on conditional generative models for data-scarce or exploratory scientific applications. Anyone evaluating or deploying such models in biology or similar fields could find the ranking and abstention tools helpful.

I would send it for peer review. The core idea addresses a real problem, and the experiments suggest it has traction, though the condition needs more attention in revision.

Referee Report

2 major / 2 minor

Summary. The paper proposes a post-hoc per-sample trust score for conditional generators in compositional extrapolation settings (where target distributions are unavailable). The score combines global realism (compatibility with the training data manifold) and attribute-wise faithfulness (relative closeness to requested attributes versus alternatives), both estimated solely from the training distribution. It claims that under a mild coverage condition on observed attributes, the score recovers meaningful comparisons that support filtering, ranking, and abstention of generations; this is demonstrated on biological imaging (preserving morphology and improving downstream prediction) and controlled vision benchmarks, and works with off-the-shelf pretrained models. Code is provided.

Significance. If the coverage condition holds with the claimed sufficiency, the approach offers a practical, reference-free method for assessing extrapolated conditional samples in scientific domains where real data for new compositions is scarce. The combination of two estimable quantities and direct applicability to pretrained models is a strength; code availability supports reproducibility.

major comments (2)

[Theoretical section defining the coverage condition and score properties] The central claim that the score 'recovers meaningful comparisons across extrapolated generations' rests on the mild coverage condition (invoked in the abstract and theoretical development). However, the precise requirements of this condition and its sufficiency to control bias in the attribute-wise faithfulness term (when target compositions are unseen) are not rigorously established; without explicit bounds or a proof that training-distribution distances still induce correct ranking under compositional shift, the justification for filtering/ranking/abstention is load-bearing and under-supported.
[Experimental results on biological imaging] In the biological imaging experiments, the reported gains in morphological structure preservation and downstream predictive performance are presented as evidence of the score's utility, but the paper does not include controls or ablations testing performance when the coverage condition is mildly violated (e.g., via synthetic shifts that break attribute coverage); this leaves open whether the empirical improvements are attributable to the score or to other factors.

minor comments (2)

[Abstract] The abstract refers to the 'mild coverage condition' without a one-sentence characterization or pointer to its formal statement, which would improve accessibility.
[Method section] Notation for the two components of the score (global realism and attribute-wise faithfulness) should be introduced with explicit equations early in the method section to avoid ambiguity when discussing their combination.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the two major comments below, clarifying the role of the coverage condition and outlining planned revisions to strengthen both the theoretical and experimental sections.

read point-by-point responses

Referee: [Theoretical section defining the coverage condition and score properties] The central claim that the score 'recovers meaningful comparisons across extrapolated generations' rests on the mild coverage condition (invoked in the abstract and theoretical development). However, the precise requirements of this condition and its sufficiency to control bias in the attribute-wise faithfulness term (when target compositions are unseen) are not rigorously established; without explicit bounds or a proof that training-distribution distances still induce correct ranking under compositional shift, the justification for filtering/ranking/abstention is load-bearing and under-supported.

Authors: We agree that the manuscript would benefit from a more explicit formalization of how the coverage condition ensures correct ranking under compositional shift. The condition is introduced in the theoretical development to ensure that attribute-wise distances estimated from the training distribution remain informative for unseen compositions. In revision we will add a dedicated proposition that states the precise coverage requirement (every relevant attribute appears with sufficient diversity in the observed data) together with a short proof sketch showing that the faithfulness term preserves the correct ordering in expectation; we will also include a brief discussion of the resulting bias term when coverage is only approximate. revision: yes
Referee: [Experimental results on biological imaging] In the biological imaging experiments, the reported gains in morphological structure preservation and downstream predictive performance are presented as evidence of the score's utility, but the paper does not include controls or ablations testing performance when the coverage condition is mildly violated (e.g., via synthetic shifts that break attribute coverage); this leaves open whether the empirical improvements are attributable to the score or to other factors.

Authors: We acknowledge that an explicit ablation under controlled violations of coverage would make the empirical claims more robust. The biological dataset satisfies the coverage condition by design, which is why the reported gains appear. In the revision we will add a controlled synthetic experiment on a vision benchmark in which we deliberately remove selected attribute co-occurrences to create mild coverage violations and report the resulting drop in the score's ability to rank or filter samples; this will directly link the observed improvements to the condition holding. revision: yes

Circularity Check

0 steps flagged

No significant circularity: score defined from independent training-distribution quantities under explicit assumption.

full rationale

The paper defines its trust score directly from two quantities (global realism and attribute-wise faithfulness) that are estimable from the training distribution alone. The claim that this score recovers meaningful comparisons is conditioned on an explicitly invoked 'mild coverage condition on the observed attributes,' which functions as an assumption rather than a derived equality. No equations, self-citations, or fitted-parameter renamings are shown that would reduce the score or its extrapolation properties to the inputs by construction. The derivation therefore remains self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no free parameters, axioms, or invented entities explicitly identified.

pith-pipeline@v0.9.1-grok · 5781 in / 1016 out tokens · 16709 ms · 2026-06-27T17:19:51.387355+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

38 extracted references · 6 canonical work pages · 2 internal anchors

[1]

How faithful is your synthetic data? sample-level metrics for evaluating and auditing generative models

Ahmed Alaa, Boris Van Breugel, Evgeny S Saveliev, and Mihaela Van Der Schaar. How faithful is your synthetic data? sample-level metrics for evaluating and auditing generative models. In International conference on machine learning, pages 290–306. PMLR, 2022

2022
[2]

Synthetic data from diffusion models improves ImageNet classification.Transactions on Machine Learning Research, 2023

Shekoofeh Azizi, Simon Kornblith, Chitwan Saharia, Mohammad Norouzi, and David J Fleet. Synthetic data from diffusion models improves ImageNet classification.Transactions on Machine Learning Research, 2023

2023
[3]

Demystifying MMD GANs

Mikołaj Bi´nkowski, Danica J Sutherland, Michael Arbel, and Arthur Gretton. Demystifying MMD GANs. InInternational Conference on Learning Representations, 2018

2018
[4]

Cellprofiler: image analysis software for identifying and quantifying cell phenotypes.Genome biology, 7:R100, 2006

Anne E Carpenter, Thouis R Jones, Michael R Lamprecht, Colin Clarke, In Han Kang, Ola Friman, David A Guertin, Joo Han Chang, Robert A Lindquist, Jason Moffat, et al. Cellprofiler: image analysis software for identifying and quantifying cell phenotypes.Genome biology, 7:R100, 2006

2006
[5]

Mor- phgen: Controllable and morphologically plausible generative cell-imaging.arXiv preprint arXiv:2510.01298, 2025

Berker Demirel, Marco Fumero, Theofanis Karaletsos, and Francesco Locatello. Mor- phgen: Controllable and morphologically plausible generative cell-imaging.arXiv preprint arXiv:2510.01298, 2025

work page arXiv 2025
[6]

Out-of-distribution detection with relative angles

Berker Demirel, Marco Fumero, and Francesco Locatello. Out-of-distribution detection with relative angles. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

2025
[7]

Diffusion models beat GANs on image synthesis

Prafulla Dhariwal and Alexander Nichol. Diffusion models beat GANs on image synthesis. In Advances in Neural Information Processing Systems, 2021

2021
[8]

How compositional generalization and creativity improve as diffusion models are trained.arXiv preprint arXiv:2502.12089, 2025

Alessandro Favero, Antonio Sclocchi, Francesco Cagnetta, Pascal Frossard, and Matthieu Wyart. How compositional generalization and creativity improve as diffusion models are trained.arXiv preprint arXiv:2502.12089, 2025

work page arXiv 2025
[9]

Coind: Enabling logical compositions in diffusion models

Sachit Gaudi, Gautam Sreekumar, and Vishnu Boddeti. Coind: Enabling logical compositions in diffusion models. InThe Thirteenth International Conference on Learning Representations, 2025

2025
[10]

Is synthetic data from generative models ready for image recognition? In International Conference on Learning Representations, 2023

Ruifei He, Shuyang Sun, Xin Yu, Chuhui Xue, Wenqing Zhang, Philip Torr, Song Bai, and Xiaojuan Qi. Is synthetic data from generative models ready for image recognition? In International Conference on Learning Representations, 2023

2023
[11]

GANs trained by a two time-scale update rule converge to a local nash equilibrium

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. GANs trained by a two time-scale update rule converge to a local nash equilibrium. InAdvances in Neural Information Processing Systems, 2017

2017
[12]

Masked autoencoders for microscopy are scalable learners of cellular biology

Oren Kraus, Kian Kenyon-Dean, Saber Saberian, Maryam Fallah, Peter McLean, Jess Leung, Vasudev Sharma, Ayla Khan, Jia Balakrishnan, Safiye Celik, et al. Masked autoencoders for microscopy are scalable learners of cellular biology. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11757–11768, 2024

2024
[13]

Improved precision and recall metric for assessing generative models.Advances in neural information processing systems, 32, 2019

Tuomas Kynkäänniemi, Tero Karras, Samuli Laine, Jaakko Lehtinen, and Timo Aila. Improved precision and recall metric for assessing generative models.Advances in neural information processing systems, 32, 2019. 10

2019
[14]

A well-conditioned estimator for large-dimensional covariance matrices.Journal of Multivariate Analysis, 88(2):365–411, 2004

Olivier Ledoit and Michael Wolf. A well-conditioned estimator for large-dimensional covariance matrices.Journal of Multivariate Analysis, 88(2):365–411, 2004

2004
[15]

A simple unified framework for detecting out-of-distribution samples and adversarial attacks.Advances in neural information processing systems, 31, 2018

Kimin Lee, Kibok Lee, Honglak Lee, and Jinwoo Shin. A simple unified framework for detecting out-of-distribution samples and adversarial attacks.Advances in neural information processing systems, 31, 2018

2018
[16]

Fast decision boundary based out-of-distribution detector

Litian Liu and Yao Qin. Fast decision boundary based out-of-distribution detector. InForty-first International Conference on Machine Learning, 2024

2024
[17]

Energy-based out-of-distribution detection.Advances in neural information processing systems, 33:21464–21475, 2020

Weitang Liu, Xiaoyun Wang, John Owens, and Yixuan Li. Energy-based out-of-distribution detection.Advances in neural information processing systems, 33:21464–21475, 2020

2020
[18]

Deep learning face attributes in the wild

Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild. InInternational Conference on Computer Vision, 2015

2015
[19]

Sit: Exploring flow and diffusion-based generative models with scalable interpolant transformers

Nanye Ma, Mark Goldstein, Michael S Albergo, Nicholas M Boffi, Eric Vanden-Eijnden, and Saining Xie. Sit: Exploring flow and diffusion-based generative models with scalable interpolant transformers. InEuropean Conference on Computer Vision, pages 23–40. Springer, 2024

2024
[20]

Mahalanobis++: Improving OOD detection via feature normalization

Maximilian Müller and Matthias Hein. Mahalanobis++: Improving OOD detection via feature normalization. InForty-second International Conference on Machine Learning, 2025

2025
[21]

Reliable fidelity and diversity metrics for generative models

Muhammad Ferjad Naeem, Seong Joon Oh, Youngjung Uh, Yunjey Choi, and Jaejun Yoo. Reliable fidelity and diversity metrics for generative models. InInternational Conference on Machine Learning, 2020

2020
[22]

Morphodiff: Cellular morphology painting with diffusion models

Zeinab Navidi, Jun Ma, Esteban Miglietta, Le Liu, Anne E Carpenter, Beth A Cimini, Benjamin Haibe-Kains, and Bo Wang. Morphodiff: Cellular morphology painting with diffusion models. InThe Thirteenth International Conference on Learning Representations, 2025

2025
[23]

Dick, and Hidenori Tanaka

Maya Okawa, Ekdeep Singh Lubana, Robert P. Dick, and Hidenori Tanaka. Compositional abilities emerge multiplicatively: Exploring diffusion models on a synthetic task. InThirty- seventh Conference on Neural Information Processing Systems, 2023

2023
[24]

Emergence of hidden capabilities: Exploring learning dynamics in concept space.Advances in Neural Information Processing Systems, 37:84698–84729, 2024

Core F Park, Maya Okawa, Andrew Lee, Hidenori Tanaka, and Ekdeep S Lubana. Emergence of hidden capabilities: Exploring learning dynamics in concept space.Advances in Neural Information Processing Systems, 37:84698–84729, 2024

2024
[25]

Probabilistic precision and recall towards reliable evaluation of generative models

Dogyun Park and Suhyun Kim. Probabilistic precision and recall towards reliable evaluation of generative models. InProceedings of the IEEE/CVF international conference on computer vision, pages 20099–20109, 2023

2023
[26]

Nearest neighbor guidance for out-of-distribution detection

Jaewoo Park, Yoon Gyo Jung, and Andrew Beng Jin Teoh. Nearest neighbor guidance for out-of-distribution detection. InProceedings of the IEEE/CVF international conference on computer vision, pages 1686–1695, 2023

2023
[27]

Early Estimation of Language to Latent Alignment in Diffusion Models

Vasco Ramos, Regev Cohen, Idan Szpektor, and Joao Magalhaes. Beyond the noise: Aligning prompts with latent representations in diffusion models.arXiv preprint arXiv:2512.08505, 2025

work page internal anchor Pith review arXiv 2025
[28]

A simple fix to mahalanobis distance for improving near-ood detection.arXiv preprint arXiv:2106.09022, 2021

Jie Ren, Stanislav Fort, Jeremiah Liu, Abhijit Guha Roy, Shreyas Padhy, and Balaji Laksh- minarayanan. A simple fix to mahalanobis distance for improving near-ood detection.arXiv preprint arXiv:2106.09022, 2021

work page arXiv 2021
[29]

High- resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High- resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022

2022
[30]

As- sessing generative models via precision and recall.Advances in neural information processing systems, 31, 2018

Mehdi SM Sajjadi, Olivier Bachem, Mario Lucic, Olivier Bousquet, and Sylvain Gelly. As- sessing generative models via precision and recall.Advances in neural information processing systems, 31, 2018. 11

2018
[31]

Oriane Siméoni, Huy V V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Michaël Ramamonjisoa, et al. Dinov3. arXiv preprint arXiv:2508.10104, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[32]

Out-of-distribution detection with deep nearest neighbors

Yiyou Sun, Yifei Ming, Xiaojin Zhu, and Yixuan Li. Out-of-distribution detection with deep nearest neighbors. InInternational conference on machine learning, pages 20827–20840. PMLR, 2022

2022
[33]

Rxrx1: A dataset for evaluating experimental batch correction methods

Maciej Sypetkowski, Morteza Rezanejad, Saber Saberian, Oren Kraus, John Urbanik, James Taylor, Ben Mabey, Mason Victors, Jason Yosinski, Alborz Rezazadeh Sereshkeh, et al. Rxrx1: A dataset for evaluating experimental batch correction methods. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4285–4294, 2023

2023
[34]

Representation alignment for generation: Training diffusion transformers is easier than you think

Sihyun Yu, Sangkyung Kwak, Huiwon Jang, Jongheon Jeong, Jonathan Huang, Jinwoo Shin, and Saining Xie. Representation alignment for generation: Training diffusion transformers is easier than you think. InThe Thirteenth International Conference on Learning Representations, 2025

2025
[35]

min gap” and “median gap

Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. InProceedings of the IEEE/CVF international conference on computer vision, pages 11975–11986, 2023. 12 A Model Pipeline Figure 5 contrasts post-generation and during-generation trust scoring. Post-generation scoring first completes denoising, ...

work page arXiv 2023
[36]

This is below any stable per-condition KID bootstrap

Filler rows.13 of the 25 seen conditions (all ct=0 and ct=2 non-controls) had only n=14 real samples each. This is below any stable per-condition KID bootstrap. Under a class-balanced probe on the 50-class subproblem, these rows collapse to top-1 ≈0—the probe cannot distinguish them from each other or from the rest of the catalog. They contribute no usabl...
[37]

The support-shift test therefore covered a single class of perturbations rather than the diversity the held-out set was designed to provide

No unseen functional diversity.The unseen part of the test set contained only control wells, even though several non-control held-out perturbations were available in held-out set of the diffusion training. The support-shift test therefore covered a single class of perturbations rather than the diversity the held-out set was designed to provide
[38]

Trust spread

Imbalance-probe confound.A class-imbalanced probe resulted in a class-frequency artifact: controls dominate the per-class sample counts. Under a class-balanced probe restricted to the 50-class subproblem, the controls collapse and the genuinely discriminable rows are the non-control siRNAs together with a smaller subset of controls. Any subset that does n...

2000

[1] [1]

How faithful is your synthetic data? sample-level metrics for evaluating and auditing generative models

Ahmed Alaa, Boris Van Breugel, Evgeny S Saveliev, and Mihaela Van Der Schaar. How faithful is your synthetic data? sample-level metrics for evaluating and auditing generative models. In International conference on machine learning, pages 290–306. PMLR, 2022

2022

[2] [2]

Synthetic data from diffusion models improves ImageNet classification.Transactions on Machine Learning Research, 2023

Shekoofeh Azizi, Simon Kornblith, Chitwan Saharia, Mohammad Norouzi, and David J Fleet. Synthetic data from diffusion models improves ImageNet classification.Transactions on Machine Learning Research, 2023

2023

[3] [3]

Demystifying MMD GANs

Mikołaj Bi´nkowski, Danica J Sutherland, Michael Arbel, and Arthur Gretton. Demystifying MMD GANs. InInternational Conference on Learning Representations, 2018

2018

[4] [4]

Cellprofiler: image analysis software for identifying and quantifying cell phenotypes.Genome biology, 7:R100, 2006

Anne E Carpenter, Thouis R Jones, Michael R Lamprecht, Colin Clarke, In Han Kang, Ola Friman, David A Guertin, Joo Han Chang, Robert A Lindquist, Jason Moffat, et al. Cellprofiler: image analysis software for identifying and quantifying cell phenotypes.Genome biology, 7:R100, 2006

2006

[5] [5]

Mor- phgen: Controllable and morphologically plausible generative cell-imaging.arXiv preprint arXiv:2510.01298, 2025

Berker Demirel, Marco Fumero, Theofanis Karaletsos, and Francesco Locatello. Mor- phgen: Controllable and morphologically plausible generative cell-imaging.arXiv preprint arXiv:2510.01298, 2025

work page arXiv 2025

[6] [6]

Out-of-distribution detection with relative angles

Berker Demirel, Marco Fumero, and Francesco Locatello. Out-of-distribution detection with relative angles. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

2025

[7] [7]

Diffusion models beat GANs on image synthesis

Prafulla Dhariwal and Alexander Nichol. Diffusion models beat GANs on image synthesis. In Advances in Neural Information Processing Systems, 2021

2021

[8] [8]

How compositional generalization and creativity improve as diffusion models are trained.arXiv preprint arXiv:2502.12089, 2025

Alessandro Favero, Antonio Sclocchi, Francesco Cagnetta, Pascal Frossard, and Matthieu Wyart. How compositional generalization and creativity improve as diffusion models are trained.arXiv preprint arXiv:2502.12089, 2025

work page arXiv 2025

[9] [9]

Coind: Enabling logical compositions in diffusion models

Sachit Gaudi, Gautam Sreekumar, and Vishnu Boddeti. Coind: Enabling logical compositions in diffusion models. InThe Thirteenth International Conference on Learning Representations, 2025

2025

[10] [10]

Is synthetic data from generative models ready for image recognition? In International Conference on Learning Representations, 2023

Ruifei He, Shuyang Sun, Xin Yu, Chuhui Xue, Wenqing Zhang, Philip Torr, Song Bai, and Xiaojuan Qi. Is synthetic data from generative models ready for image recognition? In International Conference on Learning Representations, 2023

2023

[11] [11]

GANs trained by a two time-scale update rule converge to a local nash equilibrium

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. GANs trained by a two time-scale update rule converge to a local nash equilibrium. InAdvances in Neural Information Processing Systems, 2017

2017

[12] [12]

Masked autoencoders for microscopy are scalable learners of cellular biology

Oren Kraus, Kian Kenyon-Dean, Saber Saberian, Maryam Fallah, Peter McLean, Jess Leung, Vasudev Sharma, Ayla Khan, Jia Balakrishnan, Safiye Celik, et al. Masked autoencoders for microscopy are scalable learners of cellular biology. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11757–11768, 2024

2024

[13] [13]

Improved precision and recall metric for assessing generative models.Advances in neural information processing systems, 32, 2019

Tuomas Kynkäänniemi, Tero Karras, Samuli Laine, Jaakko Lehtinen, and Timo Aila. Improved precision and recall metric for assessing generative models.Advances in neural information processing systems, 32, 2019. 10

2019

[14] [14]

A well-conditioned estimator for large-dimensional covariance matrices.Journal of Multivariate Analysis, 88(2):365–411, 2004

Olivier Ledoit and Michael Wolf. A well-conditioned estimator for large-dimensional covariance matrices.Journal of Multivariate Analysis, 88(2):365–411, 2004

2004

[15] [15]

A simple unified framework for detecting out-of-distribution samples and adversarial attacks.Advances in neural information processing systems, 31, 2018

Kimin Lee, Kibok Lee, Honglak Lee, and Jinwoo Shin. A simple unified framework for detecting out-of-distribution samples and adversarial attacks.Advances in neural information processing systems, 31, 2018

2018

[16] [16]

Fast decision boundary based out-of-distribution detector

Litian Liu and Yao Qin. Fast decision boundary based out-of-distribution detector. InForty-first International Conference on Machine Learning, 2024

2024

[17] [17]

Energy-based out-of-distribution detection.Advances in neural information processing systems, 33:21464–21475, 2020

Weitang Liu, Xiaoyun Wang, John Owens, and Yixuan Li. Energy-based out-of-distribution detection.Advances in neural information processing systems, 33:21464–21475, 2020

2020

[18] [18]

Deep learning face attributes in the wild

Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild. InInternational Conference on Computer Vision, 2015

2015

[19] [19]

Sit: Exploring flow and diffusion-based generative models with scalable interpolant transformers

Nanye Ma, Mark Goldstein, Michael S Albergo, Nicholas M Boffi, Eric Vanden-Eijnden, and Saining Xie. Sit: Exploring flow and diffusion-based generative models with scalable interpolant transformers. InEuropean Conference on Computer Vision, pages 23–40. Springer, 2024

2024

[20] [20]

Mahalanobis++: Improving OOD detection via feature normalization

Maximilian Müller and Matthias Hein. Mahalanobis++: Improving OOD detection via feature normalization. InForty-second International Conference on Machine Learning, 2025

2025

[21] [21]

Reliable fidelity and diversity metrics for generative models

Muhammad Ferjad Naeem, Seong Joon Oh, Youngjung Uh, Yunjey Choi, and Jaejun Yoo. Reliable fidelity and diversity metrics for generative models. InInternational Conference on Machine Learning, 2020

2020

[22] [22]

Morphodiff: Cellular morphology painting with diffusion models

Zeinab Navidi, Jun Ma, Esteban Miglietta, Le Liu, Anne E Carpenter, Beth A Cimini, Benjamin Haibe-Kains, and Bo Wang. Morphodiff: Cellular morphology painting with diffusion models. InThe Thirteenth International Conference on Learning Representations, 2025

2025

[23] [23]

Dick, and Hidenori Tanaka

Maya Okawa, Ekdeep Singh Lubana, Robert P. Dick, and Hidenori Tanaka. Compositional abilities emerge multiplicatively: Exploring diffusion models on a synthetic task. InThirty- seventh Conference on Neural Information Processing Systems, 2023

2023

[24] [24]

Emergence of hidden capabilities: Exploring learning dynamics in concept space.Advances in Neural Information Processing Systems, 37:84698–84729, 2024

Core F Park, Maya Okawa, Andrew Lee, Hidenori Tanaka, and Ekdeep S Lubana. Emergence of hidden capabilities: Exploring learning dynamics in concept space.Advances in Neural Information Processing Systems, 37:84698–84729, 2024

2024

[25] [25]

Probabilistic precision and recall towards reliable evaluation of generative models

Dogyun Park and Suhyun Kim. Probabilistic precision and recall towards reliable evaluation of generative models. InProceedings of the IEEE/CVF international conference on computer vision, pages 20099–20109, 2023

2023

[26] [26]

Nearest neighbor guidance for out-of-distribution detection

Jaewoo Park, Yoon Gyo Jung, and Andrew Beng Jin Teoh. Nearest neighbor guidance for out-of-distribution detection. InProceedings of the IEEE/CVF international conference on computer vision, pages 1686–1695, 2023

2023

[27] [27]

Early Estimation of Language to Latent Alignment in Diffusion Models

Vasco Ramos, Regev Cohen, Idan Szpektor, and Joao Magalhaes. Beyond the noise: Aligning prompts with latent representations in diffusion models.arXiv preprint arXiv:2512.08505, 2025

work page internal anchor Pith review arXiv 2025

[28] [28]

A simple fix to mahalanobis distance for improving near-ood detection.arXiv preprint arXiv:2106.09022, 2021

Jie Ren, Stanislav Fort, Jeremiah Liu, Abhijit Guha Roy, Shreyas Padhy, and Balaji Laksh- minarayanan. A simple fix to mahalanobis distance for improving near-ood detection.arXiv preprint arXiv:2106.09022, 2021

work page arXiv 2021

[29] [29]

High- resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High- resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022

2022

[30] [30]

As- sessing generative models via precision and recall.Advances in neural information processing systems, 31, 2018

Mehdi SM Sajjadi, Olivier Bachem, Mario Lucic, Olivier Bousquet, and Sylvain Gelly. As- sessing generative models via precision and recall.Advances in neural information processing systems, 31, 2018. 11

2018

[31] [31]

Oriane Siméoni, Huy V V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Michaël Ramamonjisoa, et al. Dinov3. arXiv preprint arXiv:2508.10104, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[32] [32]

Out-of-distribution detection with deep nearest neighbors

Yiyou Sun, Yifei Ming, Xiaojin Zhu, and Yixuan Li. Out-of-distribution detection with deep nearest neighbors. InInternational conference on machine learning, pages 20827–20840. PMLR, 2022

2022

[33] [33]

Rxrx1: A dataset for evaluating experimental batch correction methods

Maciej Sypetkowski, Morteza Rezanejad, Saber Saberian, Oren Kraus, John Urbanik, James Taylor, Ben Mabey, Mason Victors, Jason Yosinski, Alborz Rezazadeh Sereshkeh, et al. Rxrx1: A dataset for evaluating experimental batch correction methods. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4285–4294, 2023

2023

[34] [34]

Representation alignment for generation: Training diffusion transformers is easier than you think

Sihyun Yu, Sangkyung Kwak, Huiwon Jang, Jongheon Jeong, Jonathan Huang, Jinwoo Shin, and Saining Xie. Representation alignment for generation: Training diffusion transformers is easier than you think. InThe Thirteenth International Conference on Learning Representations, 2025

2025

[35] [35]

min gap” and “median gap

Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. InProceedings of the IEEE/CVF international conference on computer vision, pages 11975–11986, 2023. 12 A Model Pipeline Figure 5 contrasts post-generation and during-generation trust scoring. Post-generation scoring first completes denoising, ...

work page arXiv 2023

[36] [36]

This is below any stable per-condition KID bootstrap

Filler rows.13 of the 25 seen conditions (all ct=0 and ct=2 non-controls) had only n=14 real samples each. This is below any stable per-condition KID bootstrap. Under a class-balanced probe on the 50-class subproblem, these rows collapse to top-1 ≈0—the probe cannot distinguish them from each other or from the rest of the catalog. They contribute no usabl...

[37] [37]

The support-shift test therefore covered a single class of perturbations rather than the diversity the held-out set was designed to provide

No unseen functional diversity.The unseen part of the test set contained only control wells, even though several non-control held-out perturbations were available in held-out set of the diffusion training. The support-shift test therefore covered a single class of perturbations rather than the diversity the held-out set was designed to provide

[38] [38]

Trust spread

Imbalance-probe confound.A class-imbalanced probe resulted in a class-frequency artifact: controls dominate the per-class sample counts. Under a class-balanced probe restricted to the 50-class subproblem, the controls collapse and the genuinely discriminable rows are the non-control siRNAs together with a smaller subset of controls. Any subset that does n...

2000