Early Estimation of Language to Latent Alignment in Diffusion Models
read the original abstract
Conditional diffusion models frequently suffer from language-image misalignments. Due to the ambiguity of intermediate noise corrupted latents, assessing prompt adherence currently requires completing the entire sampling trajectory. This late-stage evaluation incurs even higher computational costs during test-time scaling strategies, such as Best-of-N (BoN) sampling, as all misaligned trajectories must finish generation before being discarded. To tackle this, we propose NoisyCLIP, a noise-aware twin-tower model that enables early language-to-latent alignment estimation. By learning a vision encoder on noise-corrupted latents, we allow the model to "see" through the ambiguity of intermediate diffusion steps. To facilitate this training, we investigate noise-data augmentation sampling strategies and introduce two new benchmark datasets: Noisy-Conceptual-Captions and Noisy-GenAI-Bench. When applied as an early-stopping criterion for BoN, NoisyCLIP at half cost matches or beats frozen CLIP at full cost. Ultimately, this transforms alignment assessment from an expensive final check into a continuous monitoring tool, drastically reducing compute costs without sacrificing semantic fidelity.
This paper has not been read by Pith yet.
Forward citations
Cited by 1 Pith paper
-
Assessing Sample Quality in Conditional Generation under Compositional Shift
Introduces a per-sample trust score combining global realism and attribute-wise faithfulness, estimable from training data alone, for assessing conditional generations under compositional shift.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.