pith. sign in

arxiv: 2512.01334 · v2 · pith:REABV55Unew · submitted 2025-12-01 · 💻 cs.CV

AlignVid: Training-Free Attention Scaling for Semantic Fidelity in Text-Guided Image-to-Video Generation

classification 💻 cs.CV
keywords attentionalignvidtextbfsemanticgenerationadditionbenchmarkfidelity
0
0 comments X
read the original abstract

Text-guided image-to-video generation has made substantial progress, yet it still struggles to execute text-specified edits that require substantial changes to a reference image (\textit{e.g., object addition, removal, or modification}). Empirically, our analysis reveals that this stems from \textbf{visual dominance}, where the reference image causes severe attention dispersion, inhibiting the model's ability to incorporate new semantic information. To address this, we propose \textbf{AlignVid}, a training-free intervention that re-calibrates the model's internal attention distribution. Drawing on an energy-based perspective of attention, AlignVid employs Attention Scaling Modulation (\textbf{ASM}) to reduce attention entropy and concentrate focus on semantic tokens, alongside Guidance Scheduling (\textbf{GS}) to maintain generation stability. To rigorously assess this capability, we present \textbf{OmitI2V}, a comprehensive benchmark for evaluating prompt adherence across object modification, addition, and deletion. Extensive experiments demonstrate that AlignVid effectively enhances semantic fidelity with negligible computational overhead. Code and the OmitI2V benchmark are available at https://github.com/LAW1223/AlignVid.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. ExtraVAR: Stage-Aware RoPE Remapping for Resolution Extrapolation in Visual Autoregressive Models

    cs.CV 2026-05 unverdicted novelty 7.0

    ExtraVAR enables resolution extrapolation in visual autoregressive models by stage-aware RoPE remapping and entropy-driven attention scaling, suppressing repetition and detail loss.