pith. sign in

arxiv: 2607.01499 · v1 · pith:WJAUS52Pnew · submitted 2026-07-01 · 💻 cs.CV

Anti-Prompt: Image Protection against Text-Guided Image-to-Video Generation

Pith reviewed 2026-07-03 20:55 UTC · model grok-4.3

classification 💻 cs.CV
keywords image protectionadversarial perturbationimage-to-video generationtext guidancedenoising processcopyright protectiontemporal consistency
0
0 comments X

The pith

Anti-Prompt adds imperceptible perturbations to images that cause text-guided image-to-video models to produce inconsistent and structurally failed videos.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes a protection method for images against unauthorized animation into videos under text prompts. It starts from the observation that removing text guidance from current I2V models sharply degrades motion, subject preservation, and temporal coherence. Anti-Prompt therefore injects small changes into the source image to reduce the model's use of text signals during denoising steps and favor visual-only processing instead. If successful, this yields a practical defense that works across model architectures without requiring access to model weights. The work also supplies a Video-LLM evaluation protocol that scores frame-level artifacts to measure protection strength.

Core claim

By attenuating text-conditioned interactions while strengthening visual-only pathways in the denoising process, targeted image perturbations reliably induce visible inconsistencies, structural failures, and loss of temporal coherence in the output videos of text-guided I2V models.

What carries the argument

Imperceptible perturbations that weaken text-conditioned interactions during denoising and reinforce visual-only pathways.

If this is right

  • Protection performance reaches strong levels on two representative I2V architectures.
  • The approach improves computational efficiency over earlier protection techniques.
  • Perturbations transfer effectively to unseen I2V models.
  • A Video-LLM-assisted protocol supplies interpretable, frame-grounded metrics for protection quality.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Models retrained to reduce reliance on text prompts might become harder to protect with this style of attack.
  • The same perturbation principle could be tested on other text-conditioned generation tasks such as image editing or 3D synthesis.
  • Widespread use would let image owners apply a one-time safeguard before uploading content online.

Load-bearing premise

Modern I2V models depend so strongly on textual guidance that small image changes can selectively disrupt that dependence and produce visible generation failures without model-specific knowledge.

What would settle it

Generate videos from protected images on the tested I2V models and check whether the expected inconsistencies, structural breaks, and temporal failures appear at rates clearly above those from unprotected images.

Figures

Figures reproduced from arXiv: 2607.01499 by Chanhui Lee, Jeany Son, Jinsoo Park, Yeonghwan Song.

Figure 1
Figure 1. Figure 1: Text-guided I2V generation and protection. [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Empirical observation: prompt removal degrades I2V generation [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Overview of Anti-Prompt. (a) We optimize an imperceptible perturbation to suppress text-dependent pathways and promote visual-only dominance. Latent dis￾tortion corresponds to the encoder attack. (b) Cross-Attention: text enters via a cross-attention residual. We suppress this text-dependent residual and strengthen the self-attention (visual-only) residual. (c) Full-Attention: video tokens attend jointly t… view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative results on CogVideoX [61]. We compare generations from the original image, I2VGuard [14], and Ours. 6 Experiments Implementation Details. Following I2VGuard [14], we set the PGD hyper￾parameters to K = 50, ϵ = 8/255, and α = 1/255 for all experiments. We unroll T = 10 denoising steps during optimization. Unless specified otherwise, we set λ1 = 1 and λ2 = 1 for CogVideoX [61], and use λ1 = 0.1 a… view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative results on LTX-Video [17]. We compare generations from the original image, I2VGuard [14], and Ours. (Qwen3-VL-8B [2]). We follow the official VBench implementation for metric computation; our failure-oriented evaluation protocol is described in Section 5. Baselines. We compare our method with I2VGuard [14], which is designed to prevent unauthorized video generation. The original I2VGuard paper … view at source ↗
Figure 6
Figure 6. Figure 6: Failure cases on static or low-motion scenes. Our method shows reduced ef￾fectiveness when the expected video dynamics are minimal, making visible animation failures harder to induce. Despite the overall degradation caused by purification, our method remains more effective than I2VGuard on most semantic, structural, and motion-related dimensions. At the same time, we observe that I2VGuard sometimes attains… view at source ↗
Figure 7
Figure 7. Figure 7: Generation results from JPEG-purified images using CogVideoX. After purifi￾cation, I2VGuard often leaves residual high-frequency noise, whereas our method more often leads to structural collapse or abnormal motion. This illustrates that purification can weaken both methods while altering their dominant failure modes in different ways. 12 Limitation Our method has several limitations. First, although it is … view at source ↗
Figure 8
Figure 8. Figure 8: Qualitative examples on an image editing model (FLUX.1 [28]). Although not systematically evaluated, we observe that protected images can lead to editing results that differ from clean editing results. Third, purification techniques such as JPEG compression or adversarial clean￾ing can partially weaken the protection effect ( [PITH_FULL_IMAGE:figures/full_fig_p029_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Shared prompt template used for Video-LLM evaluator. Subject Preservation Definition: Identity-level consistency: whether the main subject from the first frame remains present and recognizable. Anti-contamination rule: Evaluate only subject identity and presence. Ignore rendering noise, blur, flicker, motion smoothness, boundary tearing, and geometry deformation unless they directly make the subject unreco… view at source ↗
Figure 10
Figure 10. Figure 10: Subject Preservation evaluation criteria [PITH_FULL_IMAGE:figures/full_fig_p032_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Structural Consistency evaluation criteria. Dynamic Consistency Definition: Temporal and physical coherence: whether motion and state transitions across frames are temporally stable and physically plausible. Anti-contamination rule: Evaluate only time and physics coherence, including flicker, abrupt jumps, teleport-like changes, and physical implausibility. Ignore identity replacement and pure rendering a… view at source ↗
Figure 12
Figure 12. Figure 12: Dynamic Consistency evaluation criteria. Artifact Suppression Definition: Rendering-level defects independent of identity, geometry, or temporal coherence. Anti-contamination rule: Evaluate only rendering defects such as blur, noise, grid artifacts, blockiness, and banding. Ignore identity replacement, shape deformation, and coherent motion unless the rendering artifact itself is the primary issue. Repres… view at source ↗
Figure 13
Figure 13. Figure 13: Artifact Suppression evaluation criteria [PITH_FULL_IMAGE:figures/full_fig_p033_13.png] view at source ↗
read the original abstract

Recent advances in Image-to-Video generation allow a single image to be animated into a convincing video under text guidance, raising serious copyright and privacy risks. We propose Anti-Prompt, an image protection approach that injects imperceptible perturbations into an image, inducing visible inconsistencies and structural failures in text-guided I2V generation. Our method is motivated by a simple empirical observation. When text guidance is removed from modern I2V models, generation quality degrades markedly, not only in motion realism but also in subject preservation, structural coherence, and temporal consistency. Building on this insight, Anti-Prompt exploits the model reliance on textual guidance by attenuating text-conditioned interactions during denoising while strengthening visual-only pathways. To further systematically evaluate protection effectiveness, we introduce a Video-LLM-assisted evaluation protocol that provides interpretable, frame-grounded analyses of generation artifacts and inconsistencies. Experiments on two representative I2V architectures demonstrate that our method achieves strong protection performance while improving efficiency and cross-model transferability.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The paper proposes Anti-Prompt, a protection method that adds imperceptible perturbations to an input image to defend against text-guided image-to-video (I2V) generation. Motivated by the observation that removing text guidance from I2V models degrades motion realism, subject preservation, structural coherence, and temporal consistency, the approach attenuates text-conditioned interactions during denoising while strengthening visual-only pathways, inducing visible artifacts and failures in the output video. It also introduces a Video-LLM-assisted evaluation protocol for frame-grounded analysis of artifacts. Experiments on two representative I2V architectures report strong protection performance along with gains in efficiency and cross-model transferability.

Significance. If the empirical results hold under the stated conditions, the work addresses a timely copyright and privacy concern raised by recent I2V models. The empirical motivation from text-guidance ablation is a clear strength, and the Video-LLM evaluation protocol offers an interpretable, reproducible way to quantify protection that goes beyond simple visual inspection. Cross-model transferability without model-specific knowledge would be a practically valuable property if demonstrated rigorously.

major comments (2)
  1. [Abstract / Introduction] The load-bearing premise that text-conditioned interactions can be selectively attenuated by image-space perturbations without access to the target I2V model (stated in the abstract and motivated in the introduction) requires explicit verification; the manuscript should include an ablation confirming that the perturbation optimization does not rely on gradients or internal activations from the protected-against model.
  2. [§4] §4 (Experiments): the claim of improved cross-model transferability is central yet reported only at high level; quantitative transfer results across the two architectures (including failure rates or artifact scores) must be presented with statistical significance and compared against a random-perturbation baseline to establish that the effect is not due to generic image degradation.
minor comments (3)
  1. [Abstract] The abstract refers to 'two representative I2V architectures' without naming them; the method and experimental sections should state the exact models (e.g., Stable Video Diffusion, AnimateDiff) at first mention.
  2. [Evaluation protocol] The Video-LLM evaluation protocol is introduced as a contribution; its prompt template, frame sampling strategy, and inter-annotator agreement metrics should be detailed in an appendix or dedicated subsection for reproducibility.
  3. [§3] Notation for the perturbation optimization objective (likely in §3) should be introduced with a clear equation rather than prose description only.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive review and positive assessment of the work's timeliness and the strengths of the empirical motivation and Video-LLM protocol. We address each major comment below.

read point-by-point responses
  1. Referee: [Abstract / Introduction] The load-bearing premise that text-conditioned interactions can be selectively attenuated by image-space perturbations without access to the target I2V model (stated in the abstract and motivated in the introduction) requires explicit verification; the manuscript should include an ablation confirming that the perturbation optimization does not rely on gradients or internal activations from the protected-against model.

    Authors: We agree that explicit verification of the black-box nature of the optimization is valuable. The method is designed to operate without access to the target I2V model, using only image-space perturbations derived from the general observation on text guidance. We will add an ablation study in the revised manuscript (likely in §3) that confirms the optimization process does not access gradients or internal activations from the protected-against models. revision: yes

  2. Referee: [§4] §4 (Experiments): the claim of improved cross-model transferability is central yet reported only at high level; quantitative transfer results across the two architectures (including failure rates or artifact scores) must be presented with statistical significance and compared against a random-perturbation baseline to establish that the effect is not due to generic image degradation.

    Authors: We acknowledge that the cross-model results are summarized at a high level in the current version. In the revision we will expand §4 with detailed quantitative transfer results across the two architectures, including failure rates and artifact scores, statistical significance testing, and direct comparisons to a random-perturbation baseline to isolate the effect from generic degradation. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper presents an empirical method for image protection motivated by the observation that removing text guidance degrades I2V generation quality. No equations, derivations, fitted parameters presented as predictions, or load-bearing self-citations appear in the abstract or description. The central claim rests on experimental validation across architectures rather than any self-referential reduction or ansatz smuggled via prior work, rendering the derivation chain self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no extractable free parameters, axioms, or invented entities; the method description implies unstated assumptions about model internals but none are formalized.

pith-pipeline@v0.9.1-grok · 5708 in / 1259 out tokens · 19174 ms · 2026-07-03T20:55:11.016114+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

78 extracted references · 27 canonical work pages · 23 internal anchors

  1. [1]

    GPT-4 Technical Report

    Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F.L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al.: Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023)

  2. [2]

    Qwen3-VL Technical Report

    Bai, S., Cai, Y., Chen, R., Chen, K., Chen, X., Cheng, Z., Deng, L., Ding, W., Gao, C., Ge, C., et al.: Qwen3-vl technical report. arXiv preprint arXiv:2511.21631 (2025)

  3. [3]

    In: SIGGRAPH Asia (2024)

    Bar-Tal, O., Chefer, H., Tov, O., Herrmann, C., Paiss, R., Zada, S., Ephrat, A., Hur, J., Liu, G., Raj, A., et al.: Lumiere: A space-time diffusion model for video generation. In: SIGGRAPH Asia (2024)

  4. [4]

    Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

    Blattmann, A., Dockhorn, T., Kulal, S., Mendelevitch, D., Kilian, M., Lorenz, D., Levi, Y., English, Z., Voleti, V., Letts, A., et al.: Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127 (2023)

  5. [5]

    In: CVPR (2023)

    Blattmann, A., Rombach, R., Ling, H., Dockhorn, T., Kim, S.W., Fidler, S., Kreis, K.: Align your latents: High-resolution video synthesis with latent diffusion models. In: CVPR (2023)

  6. [6]

    On the Opportunities and Risks of Foundation Models

    Bommasani, R., Hudson, D.A., Adeli, E., Altman, R., Arora, S., von Arx, S., Bernstein, M.S., Bohg, J., Bosselut, A., Brunskill, E., et al.: On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258 (2021)

  7. [7]

    VideoCrafter1: Open Diffusion Models for High-Quality Video Generation

    Chen, H., Xia, M., He, Y., Zhang, Y., Cun, X., Yang, S., Xing, J., Liu, Y., Chen, Q., Wang, X., et al.: Videocrafter1: Open diffusion models for high-quality video generation. arXiv preprint arXiv:2310.19512 (2023)

  8. [8]

    In: CVPR (2024)

    Chen, H., Zhang, Y., Cun, X., Xia, M., Wang, X., Weng, C., Shan, Y.: Videocrafter2: Overcoming data limitations for high-quality video diffusion models. In: CVPR (2024)

  9. [9]

    PixArt-$\alpha$: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis

    Chen, J., Yu, J., Ge, C., Yao, L., Xie, E., Wu, Y., Wang, Z., Kwok, J., Luo, P., Lu, H., et al.: Pixart-α: Fast training of diffusion transformer for photorealistic text-to-image synthesis. arXiv preprint arXiv:2310.00426 (2023)

  10. [10]

    In: ICML (2024)

    Choi, J.S., Lee, K., Jeong, J., Xie, S., Shin, J., Lee, K.: Diffusionguard: A robust defense against malicious diffusion-based image editing. In: ICML (2024)

  11. [11]

    Diaz,G.:Adversecleanerextension(2025),https://github.com/gogodr/AdverseCleanerExtension

  12. [12]

    IEEE Transactions on Pattern Analysis and Machine Intelligence44(5), 2567–2581 (2020)

    Ding, K., Ma, K., Wang, S., Simoncelli, E.P.: Image quality assessment: Unify- ing structure and texture similarity. IEEE Transactions on Pattern Analysis and Machine Intelligence44(5), 2567–2581 (2020)

  13. [13]

    In: ICML (2024)

    Esser, P., Kulal, S., Blattmann, A., Entezari, R., Müller, J., Saini, H., Levi, Y., Lorenz, D., Sauer, A., Boesel, F., et al.: Scaling rectified flow transformers for high-resolution image synthesis. In: ICML (2024)

  14. [14]

    In: CVPR (2025)

    Gui, D., Guo, X., Zhou, W., Lu, Y.: I2vguard: Safeguarding images against misuse in diffusion-based image-to-video models. In: CVPR (2025)

  15. [15]

    In: ICLR (2018)

    Guo, C., Rana, M., Cisse, M., van der Maaten, L.: Countering adversarial images using input transformations. In: ICLR (2018)

  16. [16]

    In: ICLR (2024)

    Guo, Y., Yang, C., Rao, A., Liang, Z., Wang, Y., Qiao, Y., Agrawala, M., Lin, D., Dai, B.: Animatediff: Animate your personalized text-to-image diffusion models without specific tuning. In: ICLR (2024)

  17. [17]

    LTX-Video: Realtime Video Latent Diffusion

    HaCohen, Y., Chiprut, N., Brazowski, B., Shalem, D., Moshe, D., Richardson, E., Levin, E., Shiran, G., Zabari, N., Gordon, O., et al.: Ltx-video: Realtime video latent diffusion. arXiv preprint arXiv:2501.00103 (2024) Anti-Prompt 17

  18. [18]

    In: CVPR (2025)

    Han, H., Li, S., Chen, J., Yuan, Y., Wu, Y., Deng, Y., Leong, C.T., Du, H., Fu, J., Li, Y., et al.: Video-bench: Human-aligned video generation benchmark. In: CVPR (2025)

  19. [19]

    Latent Video Diffusion Models for High-Fidelity Long Video Generation

    He, Y., Yang, T., Zhang, Y., Shan, Y., Chen, Q.: Latent video diffusion models for high-fidelity long video generation. arXiv preprint arXiv:2211.13221 (2022)

  20. [20]

    Imagen Video: High Definition Video Generation with Diffusion Models

    Ho, J., Chan, W., Saharia, C., Whang, J., Gao, R., Gritsenko, A., Kingma, D.P., Poole, B., Norouzi, M., Fleet, D.J., et al.: Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303 (2022)

  21. [21]

    In: NeurIPS (2020)

    Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. In: NeurIPS (2020)

  22. [22]

    In: NeurIPS (2022)

    Ho, J., Salimans, T., Gritsenko, A., Chan, W., Norouzi, M., Fleet, D.J.: Video diffusion models. In: NeurIPS (2022)

  23. [23]

    In: CVPR (2024)

    Huang, Z., He, Y., Yu, J., Zhang, F., Si, C., Jiang, Y., Zhang, Y., Wu, T., Jin, Q., Chanpaisit, N., et al.: Vbench: Comprehensive benchmark suite for video genera- tive models. In: CVPR (2024)

  24. [24]

    VBench++: Comprehensive and versatile benchmark suite for video generative models.arXiv preprint arXiv:2411.13503, 2024

    Huang, Z., Zhang, F., Xu, X., He, Y., Yu, J., Dong, Z., Ma, Q., Chanpaisit, N., Si, C., Jiang, Y., et al.: Vbench++: Comprehensive and versatile benchmark suite for video generative models. arXiv preprint arXiv:2411.13503 (2024)

  25. [25]

    In: ICLR (2025)

    Jeon, J., Kim, W.J., Ha, S., Son, S., Yoon, S.e.: Advpaint: Protecting images from inpainting manipulation via adversarial attention disruption. In: ICLR (2025)

  26. [26]

    Biometrika30(1-2), 81–93 (1938)

    Kendall, M.G.: A new measure of rank correlation. Biometrika30(1-2), 81–93 (1938)

  27. [27]

    The Annals of Mathematical Statistics10(3), 275–287 (1939).https://doi.org/10.1214/aoms/ 1177732186

    Kendall, M.G., Babington Smith, B.: The problem of m rankings. The Annals of Mathematical Statistics10(3), 275–287 (1939).https://doi.org/10.1214/aoms/ 1177732186

  28. [28]

    FLUX.1 Kontext: Flow Matching for In-Context Image Generation and Editing in Latent Space

    Labs, B.F., Batifol, S., Blattmann, A., Boesel, F., Consul, S., Diagne, C., Dock- horn, T., English, J., English, Z., Esser, P., et al.: Flux. 1 kontext: Flow match- ing for in-context image generation and editing in latent space. arXiv preprint arXiv:2506.15742 (2025)

  29. [29]

    Universal Image Immunization against Diffusion-based Image Editing via Semantic Injection

    Lee, C., Shin, S., Choi, D., Jeon, H.g., Son, J.: Universal image immuniza- tion against diffusion-based image editing via semantic injection. arXiv preprint arXiv:2602.14679 (2026)

  30. [30]

    arXiv preprint arXiv:2402.01239 (2024)

    Li, G., Yang, S., Zhang, J., Zhang, T.: Prime: Protect your videos from malicious editing. arXiv preprint arXiv:2402.01239 (2024)

  31. [31]

    In: ICML (2023)

    Liang, C., Wu, X., Hua, Y., Zhang, J., Xue, Y., Song, T., Xue, Z., Ma, R., Guan, H.: Adversarial example does good: preventing painting imitation from diffusion models via adversarial examples. In: ICML (2023)

  32. [32]

    In: EMNLP (2024)

    Lin, B., Ye, Y., Zhu, B., Cui, J., Ning, M., Jin, P., Yuan, L.: Video-llava: Learning united visual representation by alignment before projection. In: EMNLP (2024)

  33. [33]

    In: EMNLP (2023)

    Liu, Y., Iter, D., Xu, Y., Wang, S., Xu, R., Zhu, C.: G-eval: Nlg evaluation using gpt-4 with better human alignment. In: EMNLP (2023)

  34. [34]

    In: CVPR (2024)

    Lo, L., Yeo, C.Y., Shuai, H.H., Cheng, W.H.: Distraction is all you need: Memory- efficient image immunization against diffusion-based image editing. In: CVPR (2024)

  35. [35]

    In: ACL (2024)

    Maaz, M., Rasheed, H., Khan, S., Khan, F.: Video-chatgpt: Towards detailed video understanding via large vision and language models. In: ACL (2024)

  36. [36]

    ACM computing surveys (CSUR)54(1), 1–41 (2021)

    Mirsky, Y., Lee, W.: The creation and detection of deepfakes: A survey. ACM computing surveys (CSUR)54(1), 1–41 (2021)

  37. [37]

    In: CVPR (2017) 18 Songet al

    Moosavi-Dezfooli, S.M., Fawzi, A., Fawzi, O., Frossard, P.: Universal adversarial perturbations. In: CVPR (2017) 18 Songet al

  38. [38]

    OpenAI: Using gpt-5 (gpt-5 model family) — openai api documentation.https: //developers.openai.com/api/docs/guides/latest-model/

  39. [39]

    Pro- ceedings of the Royal Society of London58, 240–242 (1895)

    Pearson, K.: Note on regression and inheritance in the case of two parents. Pro- ceedings of the Royal Society of London58, 240–242 (1895)

  40. [40]

    In: ICCV (2023)

    Peebles, W., Xie, S.: Scalable diffusion models with transformers. In: ICCV (2023)

  41. [41]

    SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

    Podell, D., English, Z., Lacey, K., Blattmann, A., Dockhorn, T., Müller, J., Penna, J., Rombach, R.: Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952 (2023)

  42. [42]

    Hierarchical Text-Conditional Image Generation with CLIP Latents

    Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., Chen, M.: Hierarchical text- conditional image generation with clip latents. arXiv preprint arXiv:2204.06125 (2022)

  43. [43]

    In: CVPR (2022)

    Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: CVPR (2022)

  44. [44]

    In: MICCAI (2015)

    Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomed- ical image segmentation. In: MICCAI (2015)

  45. [45]

    NeurIPS (2022)

    Saharia, C., Chan, W., Saxena, S., Li, L., Whang, J., Denton, E.L., Ghasemipour, K., Gontijo Lopes, R., Karagol Ayan, B., Salimans, T., et al.: Photorealistic text- to-image diffusion models with deep language understanding. NeurIPS (2022)

  46. [46]

    In: ICML (2023)

    Salman, H., Khaddaj, A., Leclerc, G., Ilyas, A., Madry, A.: Raising the cost of malicious ai-powered image editing. In: ICML (2023)

  47. [47]

    In: USENIX Secu- rity (2023)

    Shan, S., Cryan, J., Wenger, E., Zheng, H., Hanocka, R., Zhao, B.Y.: Glaze: Pro- tecting artists from style mimicry by{Text-to-Image}models. In: USENIX Secu- rity (2023)

  48. [48]

    Make-A-Video: Text-to-Video Generation without Text-Video Data

    Singer, U., Polyak, A., Hayes, T., Yin, X., An, J., Zhang, S., Hu, Q., Yang, H., Ashual, O., Gafni, O., et al.: Make-a-video: Text-to-video generation without text- video data. arXiv preprint arXiv:2209.14792 (2022)

  49. [49]

    Denoising Diffusion Implicit Models

    Song, J., Meng, C., Ermon, S.: Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502 (2020)

  50. [50]

    The American Journal of Psychology15(1), 72–101 (1904).https://doi.org/10

    Spearman, C.: The proof and measurement of association between two things. The American Journal of Psychology15(1), 72–101 (1904).https://doi.org/10. 2307/1412159

  51. [51]

    MAGI-1: Autoregressive Video Generation at Scale

    Teng, H., Jia, H., Sun, L., Li, L., Li, M., Tang, M., Han, S., Zhang, T., Zhang, W., Luo, W., et al.: Magi-1: Autoregressive video generation at scale. arXiv preprint arXiv:2505.13211 (2025)

  52. [52]

    Information fusion64, 131–148 (2020)

    Tolosana, R., Vera-Rodriguez, R., Fierrez, J., Morales, A., Ortega-Garcia, J.: Deep- fakes and beyond: A survey of face manipulation and fake detection. Information fusion64, 131–148 (2020)

  53. [53]

    NeurIPS (2017)

    Vaswani,A.,Shazeer,N.,Parmar,N.,Uszkoreit,J.,Jones,L.,Gomez,A.N.,Kaiser, Ł., Polosukhin, I.: Attention is all you need. NeurIPS (2017)

  54. [54]

    Phenaki: Variable Length Video Generation From Open Domain Textual Description

    Villegas, R., Babaeizadeh, M., Kindermans, P.J., Moraldo, H., Zhang, H., Saffar, M.T., Castro, S., Kunze, J., Erhan, D.: Phenaki: Variable length video generation from open domain textual description. arXiv preprint arXiv:2210.02399 (2022)

  55. [55]

    Wan: Open and Advanced Large-Scale Video Generative Models

    Wan, T., Wang, A., Ai, B., Wen, B., Mao, C., Xie, C.W., Chen, D., Yu, F., Zhao, H., Yang, J., Zeng, J., Wang, J., Zhang, J., Zhou, J., Wang, J., Chen, J., Zhu, K., Zhao, K., Yan, K., Huang, L., Feng, M., Zhang, N., Li, P., Wu, P., Chu, R., Feng, R., Zhang, S., Sun, S., Fang, T., Wang, T., Gui, T., Weng, T., Shen, T., Lin, W., Wang, W., Wang, W., Zhou, W.,...

  56. [56]

    NeurIPS (2023)

    Wang, X., Yuan, H., Zhang, S., Chen, D., Wang, J., Zhang, Y., Shen, Y., Zhao, D., Zhou, J.: Videocomposer: Compositional video synthesis with motion control- lability. NeurIPS (2023)

  57. [57]

    IEEE transactions on image processing 13(4), 600–612 (2004)

    Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing 13(4), 600–612 (2004)

  58. [58]

    arXiv preprint arXiv:2310.11986 (2023)

    Weidinger, L., Rauh, M., Marchal, N., Manzini, A., Hendricks, L.A., Mateos- Garcia, J., Bergman, S., Kay, J., Griffin, C., Bariach, B., et al.: Sociotechnical safety evaluation of generative ai systems. arXiv preprint arXiv:2310.11986 (2023)

  59. [59]

    In: ICCV (2023)

    Wu, J.Z., Ge, Y., Wang, X., Lei, S.W., Gu, Y., Shi, Y., Hsu, W., Shan, Y., Qie, X., Shou, M.Z.: Tune-a-video: One-shot tuning of image diffusion models for text- to-video generation. In: ICCV (2023)

  60. [60]

    In: ECCV (2024)

    Xing,J.,Xia,M.,Zhang,Y.,Chen,H.,Yu,W.,Liu,H.,Liu,G.,Wang,X.,Shan,Y., Wong, T.T.: Dynamicrafter: Animating open-domain images with video diffusion priors. In: ECCV (2024)

  61. [61]

    In: ICLR (2025)

    Yang, Z., Teng, J., Zheng, W., Ding, M., Huang, S., Xu, J., Yang, Y., Hong, W., Zhang, X., Feng, G., et al.: Cogvideox: Text-to-video diffusion models with an expert transformer. In: ICLR (2025)

  62. [62]

    VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding

    Zhang, B., Li, K., Cheng, Z., Hu, Z., Yuan, Y., Chen, G., Leng, S., Jiang, Y., Zhang, H., Li, X., et al.: Videollama 3: Frontier multimodal foundation models for image and video understanding. arXiv preprint arXiv:2501.13106 (2025)

  63. [63]

    In: EMNLP (2023)

    Zhang, H., Li, X., Bing, L.: Video-llama: An instruction-tuned audio-visual lan- guage model for video understanding. In: EMNLP (2023)

  64. [64]

    In: CVPR (2018)

    Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep features as a perceptual metric. In: CVPR (2018)

  65. [65]

    I2VGen-XL: High-Quality Image-to-Video Synthesis via Cascaded Diffusion Models

    Zhang, S., Wang, J., Zhang, Y., Zhao, K., Yuan, H., Qin, Z., Wang, X., Zhao, D., Zhou, J.: I2vgen-xl: High-quality image-to-video synthesis via cascaded diffusion models. arXiv preprint arXiv:2311.04145 (2023)

  66. [66]

    VBench-2.0: Advancing Video Generation Benchmark Suite for Intrinsic Faithfulness

    Zheng, D., Huang, Z., Liu, H., Zou, K., He, Y., Zhang, F., Gu, L., Zhang, Y., He, J., Zheng, W.S., et al.: Vbench-2.0: Advancing video generation benchmark suite for intrinsic faithfulness. arXiv preprint arXiv:2503.21755 (2025)

  67. [67]

    NeurIPS (2023)

    Zheng, L., Chiang, W.L., Sheng, Y., Zhuang, S., Wu, Z., Zhuang, Y., Lin, Z., Li, Z., Li, D., Xing, E., et al.: Judging llm-as-a-judge with mt-bench and chatbot arena. NeurIPS (2023)

  68. [68]

    Open-Sora: Democratizing Efficient Video Production for All

    Zheng, Z., Peng, X., Yang, T., Shen, C., Li, S., Liu, H., Zhou, Y., Li, T., You, Y.: Open-sora: Democratizing efficient video production for all. arXiv preprint arXiv:2412.20404 (2024)

  69. [69]

    InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

    Zhu, J., Wang, W., Chen, Z., Liu, Z., Ye, S., Gu, L., Tian, H., Duan, Y., Su, W., Shao, J., et al.: Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models. arXiv preprint arXiv:2504.10479 (2025) 20 Songet al. 8 Efficiency Analysis Our implementation is designed to minimize computational overhead during pro- tection ...

  70. [70]

    First, provide up to three concrete observations with timestamps or frame indices

  71. [71]

    Then, and only then, assign a discrete score from 1 to 5 based only on those observations

  72. [72]

    Assign the lowest score supported by the observed evidence

  73. [73]

    If no clear failure is observed, assign 5

  74. [74]

    uncertain

    If evidence is insufficient or ambiguous, set"uncertain": trueand use score 3

  75. [75]

    Do not speculate beyond what is visible

    Output only valid JSON. Do not speculate beyond what is visible. Input Context Provided to the Model: –Video ID: $VIDEO_ID –Prompt (context only): $PROMPT_STR –Assume FIRST frame is the input image. –A sequence of sampled frames, each preceded by a textual marker such asFrame $FRAME_INDEX (t=$TIMESTAMP s) Dimension-Specific Fields Injected into the Shared...

  76. [76]

    Provide up to three concrete observations with timestamps or frame indices

  77. [77]

    Based only on those observations, assign the score

  78. [78]

    uncertain

    If uncertain due to insufficient evidence, set"uncertain": trueand use score 3. Required Output Schema: { "observations": [ {"time": "e.g., 2.0s or frame#25", "description": "what is visibly observed"} ], "score": 1, "uncertain": false } Fig. 9:Shared prompt template used for Video-LLM evaluator. Subject Preservation Definition:Identity-level consistency:...