Early Estimation of Language to Latent Alignment in Diffusion Models

Idan Szpektor; Joao Magalhaes; Regev Cohen; Vasco Ramos

arxiv: 2512.08505 · v2 · pith:XUAUYIG6new · submitted 2025-12-09 · 💻 cs.CV

Early Estimation of Language to Latent Alignment in Diffusion Models

Vasco Ramos , Regev Cohen , Idan Szpektor , Joao Magalhaes This is my paper

classification 💻 cs.CV

keywords alignmentdiffusionsamplingambiguitycostcostsearlyestimation

0 comments

read the original abstract

Conditional diffusion models frequently suffer from language-image misalignments. Due to the ambiguity of intermediate noise corrupted latents, assessing prompt adherence currently requires completing the entire sampling trajectory. This late-stage evaluation incurs even higher computational costs during test-time scaling strategies, such as Best-of-N (BoN) sampling, as all misaligned trajectories must finish generation before being discarded. To tackle this, we propose NoisyCLIP, a noise-aware twin-tower model that enables early language-to-latent alignment estimation. By learning a vision encoder on noise-corrupted latents, we allow the model to "see" through the ambiguity of intermediate diffusion steps. To facilitate this training, we investigate noise-data augmentation sampling strategies and introduce two new benchmark datasets: Noisy-Conceptual-Captions and Noisy-GenAI-Bench. When applied as an early-stopping criterion for BoN, NoisyCLIP at half cost matches or beats frozen CLIP at full cost. Ultimately, this transforms alignment assessment from an expensive final check into a continuous monitoring tool, drastically reducing compute costs without sacrificing semantic fidelity.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Assessing Sample Quality in Conditional Generation under Compositional Shift
cs.LG 2026-06 unverdicted novelty 7.0

Introduces a per-sample trust score combining global realism and attribute-wise faithfulness, estimable from training data alone, for assessing conditional generations under compositional shift.