AlignVid: Training-Free Attention Scaling for Semantic Fidelity in Text-Guided Image-to-Video Generation

Haoze Zheng; Harry Yang; Jingjin Zhu; Manyuan Zhang; Ser-Nam Lim; Wen-Jie Shu; Yexin Liu; Yueze Wang; Zile Huang

arxiv: 2512.01334 · v2 · pith:REABV55Unew · submitted 2025-12-01 · 💻 cs.CV

AlignVid: Training-Free Attention Scaling for Semantic Fidelity in Text-Guided Image-to-Video Generation

Yexin Liu , Wen-Jie Shu , Zile Huang , Haoze Zheng , Yueze Wang , Jingjin Zhu , Manyuan Zhang , Ser-Nam Lim

show 1 more author

Harry Yang

This is my paper

classification 💻 cs.CV

keywords attentionalignvidtextbfsemanticgenerationadditionbenchmarkfidelity

0 comments

read the original abstract

Text-guided image-to-video generation has made substantial progress, yet it still struggles to execute text-specified edits that require substantial changes to a reference image (\textit{e.g., object addition, removal, or modification}). Empirically, our analysis reveals that this stems from \textbf{visual dominance}, where the reference image causes severe attention dispersion, inhibiting the model's ability to incorporate new semantic information. To address this, we propose \textbf{AlignVid}, a training-free intervention that re-calibrates the model's internal attention distribution. Drawing on an energy-based perspective of attention, AlignVid employs Attention Scaling Modulation (\textbf{ASM}) to reduce attention entropy and concentrate focus on semantic tokens, alongside Guidance Scheduling (\textbf{GS}) to maintain generation stability. To rigorously assess this capability, we present \textbf{OmitI2V}, a comprehensive benchmark for evaluating prompt adherence across object modification, addition, and deletion. Extensive experiments demonstrate that AlignVid effectively enhances semantic fidelity with negligible computational overhead. Code and the OmitI2V benchmark are available at https://github.com/LAW1223/AlignVid.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

ExtraVAR: Stage-Aware RoPE Remapping for Resolution Extrapolation in Visual Autoregressive Models
cs.CV 2026-05 unverdicted novelty 7.0

ExtraVAR enables resolution extrapolation in visual autoregressive models by stage-aware RoPE remapping and entropy-driven attention scaling, suppressing repetition and detail loss.