Early Estimation of Language to Latent Alignment in Diffusion Models

· 2025 · cs.CV · arXiv 2512.08505

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

open full Pith review browse 1 citing papers arXiv PDF

abstract

Conditional diffusion models frequently suffer from language-image misalignments. Due to the ambiguity of intermediate noise corrupted latents, assessing prompt adherence currently requires completing the entire sampling trajectory. This late-stage evaluation incurs even higher computational costs during test-time scaling strategies, such as Best-of-N (BoN) sampling, as all misaligned trajectories must finish generation before being discarded. To tackle this, we propose NoisyCLIP, a noise-aware twin-tower model that enables early language-to-latent alignment estimation. By learning a vision encoder on noise-corrupted latents, we allow the model to "see" through the ambiguity of intermediate diffusion steps. To facilitate this training, we investigate noise-data augmentation sampling strategies and introduce two new benchmark datasets: Noisy-Conceptual-Captions and Noisy-GenAI-Bench. When applied as an early-stopping criterion for BoN, NoisyCLIP at half cost matches or beats frozen CLIP at full cost. Ultimately, this transforms alignment assessment from an expensive final check into a continuous monitoring tool, drastically reducing compute costs without sacrificing semantic fidelity.

representative citing papers

Assessing Sample Quality in Conditional Generation under Compositional Shift

cs.LG · 2026-06-08 · unverdicted · novelty 7.0

Introduces a per-sample trust score combining global realism and attribute-wise faithfulness, estimable from training data alone, for assessing conditional generations under compositional shift.

citing papers explorer

Showing 1 of 1 citing paper.

Assessing Sample Quality in Conditional Generation under Compositional Shift cs.LG · 2026-06-08 · unverdicted · none · ref 27 · internal anchor
Introduces a per-sample trust score combining global realism and attribute-wise faithfulness, estimable from training data alone, for assessing conditional generations under compositional shift.

Early Estimation of Language to Latent Alignment in Diffusion Models

fields

years

verdicts

representative citing papers

citing papers explorer