CVG improves compositional faithfulness in frozen text-to-video diffusion models by steering early denoising steps with gradients from a classifier trained on the model's own cross-attention features.
In: Advances in Neural In- formation Processing Systems
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.CV 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
Compositional Video Generation via Inference-Time Guidance
CVG improves compositional faithfulness in frozen text-to-video diffusion models by steering early denoising steps with gradients from a classifier trained on the model's own cross-attention features.