pith. sign in

arxiv: 2411.15115 · v3 · submitted 2024-11-22 · 💻 cs.CV · cs.AI· cs.CL

Self-Correcting Text-to-Video Generation with Misalignment Detection and Localized Refinement

Pith reviewed 2026-05-23 08:23 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.CL
keywords text-to-video generationvideo refinementmisalignment detectionlocalized refinementtraining-free methodmodel-agnosticdiffusion modelsprompt alignment
0
0 comments X

The pith

VideoRepair detects fine-grained text-video misalignments and performs targeted local refinements while preserving correct regions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents VideoRepair as a training-free framework that identifies where generated videos fail to match complex text prompts and corrects only those parts. This approach matters because full regeneration often discards accurate content and current T2V models frequently misalign on prompts with multiple objects or relations. The method relies on an MLLM to spot issues through auto-generated questions, then plans refinements that keep faithful areas intact before selectively regenerating the rest. If the central claim holds, existing diffusion-based video generators can produce better-aligned outputs across different base models without additional training. The work shows gains on two standard benchmarks using four recent backbones.

Core claim

The authors claim that a three-stage process—MLLM-driven misalignment detection with automatically generated questions, refinement planning that segments and preserves correct entities across frames, and joint optimization for localized regeneration—enables self-correction of text-to-video outputs. This process is model-agnostic and training-free, and experiments on EvalCrafter and T2V-CompBench demonstrate substantial gains in alignment metrics over recent baselines.

What carries the argument

Region-preserving refinement strategy with misalignment detection via MLLM, refinement planning, and localized refinement.

If this is right

  • The method improves alignment metrics across diverse prompts on EvalCrafter and T2V-CompBench.
  • It works without retraining when applied to four different recent T2V diffusion backbones.
  • It reduces unnecessary changes by keeping correctly generated regions intact during refinement.
  • Ablations confirm the framework remains efficient and produces interpretable correction steps.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same detection-plus-local-fix pattern could be tested on text-to-image models that also struggle with complex attribute binding.
  • If the MLLM detector generalizes, it might serve as an automatic quality filter before any refinement step.
  • The preservation of correct regions suggests a route to lower inference cost compared with full video regeneration.

Load-bearing premise

The MLLM-based evaluation with automatically generated questions reliably identifies which regions of the video are misaligned with the text prompt.

What would settle it

Running the detection stage on a set of videos with known, human-verified misalignments and finding that the MLLM fails to flag the actual mismatched regions at rates above random selection.

Figures

Figures reproduced from arXiv: 2411.15115 by Daeun Lee, Jaehong Yoon, Jaemin Cho, Mohit Bansal.

Figure 1
Figure 1. Figure 1: VIDEOREPAIR is a model-agnostic, training-free, automatic refinement framework for improving alignments in text-to-video generation. Given an initial video from a text-to-video generation model, VIDEOREPAIR refines video in two stages: (1) video refinement planning and (2) localized refinement. The black-white mask in the bottom left of each example indicates the localized refinement plan (black: regions t… view at source ↗
Figure 2
Figure 2. Figure 2: Comparison of different refinement methods for align￾ment. (a) Prompt optimization (e.g., OPT2I [30]) by LLM-based rewriting without visual/fine-grained feedback, making the search expensive (e.g., 30 iterations). (b) Recent work on localized feed￾back (e.g., SLD [55]) provides visual guidance but relies on an ex￾ternal layout-guided generation module, often leading to unnatural refinements. (c) VIDEOREPAI… view at source ↗
Figure 3
Figure 3. Figure 3: Illustration of VIDEOREPAIR. VIDEOREPAIR refines the generated video in two stages: (1) video refinement planning (Sec. 3.1), (2) localized refinement (Sec. 3.2). Given the prompt p, we first generate a fine-grained evaluation question set and ask the MLLM to provide answers. Next, we identify accurately generated objects O ∗ and plan the refinement p r of other regions using MLLM/LLM. Based on O ∗ , we de… view at source ↗
Figure 4
Figure 4. Figure 4: Videos generated with T2V-turbo and refinement frameworks (OPT2I / SLD / VIDEOREPAIR) on T2V-turbo. VIDEOREPAIR successfully addresses object and attribute misalignment issues (e.g., numeracy, spatial relationship, attribute blending) compared to T2V￾turbo and other refinement methods. More visualization examples with T2V-turbo and VideoCrafter2 are provided in the appendix. A family of four set up a tent … view at source ↗
Figure 5
Figure 5. Figure 5: The iterative refinement of VIDEOREPAIR. Videos in each column represent the outputs of successive refinement iterations, where the output from the previous step serves as the input for the current step. The text at the bottom of each video row indicates the corresponding text prompt. More visualization examples are provided in the appendix. while maintaining the integrity of multi-object generation. In ad… view at source ↗
Figure 6
Figure 6. Figure 6: Refining videos when the key object disappears. VIDE￾OREPAIR successfully preserves disappearing objects (car) while incorporating previously missed objects (house). (i.e., K = 1), compared to strong baselines (LLM paraphras￾ing: 44.7, SLD: 44.5, OPT2I: 45.7 in average), highlighting the effectiveness of VIDEOREPAIR refinement process. Fur￾thermore, we enhance text-video alignment by incorporating video ra… view at source ↗
Figure 7
Figure 7. Figure 7: Comparison of DSG and DSGObj . Compared to DSGObj (ours), DSG does not penalize the video even if more than one object (e.g., 1 bear in this case) is generated when the target object count = 1 in the text prompt. A.2. Visual Question Answering To evaluate the generated videos, we utilize GPT-4o to an￾swer both count-related (Qo c ) and attribute-related (Qo a ) ques￾tions, as illustrated in [PITH_FULL_IMA… view at source ↗
Figure 8
Figure 8. Figure 8: Impact of the number of video candidates. We vary the number of video candidates K as 1, 5, 10, and 20 for ranking. 13 [PITH_FULL_IMAGE:figures/full_fig_p013_8.png] view at source ↗
Figure 10
Figure 10. Figure 10: Single-object mask vs. Multi-object mask. serves the O∗ areas while refining the remaining regions us￾ing p r . For instance, in [PITH_FULL_IMAGE:figures/full_fig_p014_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Output from each step of VIDEOREPAIR. We illustrate whole outputs from each step of VIDEOREPAIR. {'Q': 'Are there four children?', 'A': 1.0, 'reasoning': 'There are four visible children in the image.', 'obj_in_prompt': 4, 'obj_in_img': 4} {'Q': 'Are there three dogs?', 'A': 0.0, 'reasoning': 'There is only one dog visible in the image.', 'obj_in_prompt': 3, 'obj_in_img': 1} {'Q': 'Is there a picnic?', 'A… view at source ↗
Figure 12
Figure 12. Figure 12: Output from each step of VIDEOREPAIR. We illustrate whole outputs from each step of VIDEOREPAIR. 15 [PITH_FULL_IMAGE:figures/full_fig_p015_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Qualitative examples from T2V-turbo. 1 bear and 2 people making pizza T2V-turbo OPT2I (Iter=10) SLD (Iter=1) VideoRepair (Iter=1) Frame 1 Frame 8 Frame 16 Frame 1 Frame 8 Frame 16 Vico Teddy bear and 3 real bear [PITH_FULL_IMAGE:figures/full_fig_p016_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Qualitative examples from T2V-turbo. 16 [PITH_FULL_IMAGE:figures/full_fig_p016_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Qualitative examples from T2V-turbo. five aliens in a forest T2V-turbo OPT2I (Iter=10) SLD (Iter=1) VideoRepair (Iter=1) Frame 1 Frame 8 Frame 16 Frame 1 Frame 8 Frame 16 Vico Five colorful parrots perch on a branch, squawking loudly at each other [PITH_FULL_IMAGE:figures/full_fig_p017_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Qualitative examples from T2V-turbo. 17 [PITH_FULL_IMAGE:figures/full_fig_p017_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Qualitative examples from VideoCrafter2. A dog sitting under a umbrella on a sunny beach OPT2I (Iter=10) SLD (Iter=1) VideoRepair (Iter=1) Frame 1 Frame 8 Frame 16 Frame 1 Frame 8 Frame 16 Vico With the style of pointilism, A green apple and a black backpack. VideoCrafter2 [PITH_FULL_IMAGE:figures/full_fig_p018_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Qualitative examples from VideoCrafter2. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Qualitative examples from VideoCrafter2. A team of two marine biologists study a colony of penguins, monitoring their breeding habits. Iteration 1~3 A group of six dancers perform a ballet on stage, their movements synchronized and graceful. Iteration 1~3 A bright yellow umbrella with a wooden handle. It's compact and easy to carry. A mother and her child feed ducks at a pond. Five cows graze lazily in a … view at source ↗
Figure 20
Figure 20. Figure 20: Videos generated using iterative refinement with VIDEOREPAIR. We depict iterative refinement results generated from T2V-Turbo. Overall, VIDEOREPAIR progressively enhances text-video alignment with each refinement step. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: Prompts to perform visual question answering in video evaluation steps. Top: The prompt for Q o c (count-related question), Bottom: prompt for Q o a (attribute-related question). cur question means each DSGObj question and key objects means entity word in each question. Given the image which compose of multiple concatenated frames from a video and the list of question￾answer pairs for each object, represe… view at source ↗
Figure 22
Figure 22. Figure 22: Prompt to choose which object(s) to preserve. We ask GPT4o to select objects to preserve in the scene. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_22.png] view at source ↗
Figure 23
Figure 23. Figure 23: Prompt to plan how to refine the other regions. We use five in-context examples to create the refinement prompt from the question related to other objects. Generate 1 paraphrase of the following image description while keeping the semantic meaning: "{init_prompt}". Provide your response as a single phrase without any explanation. Format it as: <PROMPT> ... </PROMPT>. (e.g., <PROMPT>Two dogs and a whale em… view at source ↗
Figure 24
Figure 24. Figure 24: Prompt for LLM paraphrasing. Following OPT2I [30], we ask GPT4 to generate diverse paraphrases of each prompt for LLM paraphrasing baseline experiments. 21 [PITH_FULL_IMAGE:figures/full_fig_p021_24.png] view at source ↗
read the original abstract

Recent text-to-video (T2V) diffusion models have made remarkable progress in generating high-quality videos. However, they often struggle to align with complex text prompts, particularly when multiple objects, attributes, or spatial relations are specified. We introduce VideoRepair, the first self-correcting, training-free, and model-agnostic video refinement framework that automatically detects fine-grained text-video misalignments and performs targeted, localized corrections. Our key insight is that even misaligned videos usually contain correctly generated regions that should be preserved rather than regenerated. Building on this observation, VideoRepair proposes a novel region-preserving refinement strategy with three stages: (i) misalignment detection, where MLLM-based evaluation with automatically generated evaluation questions identifies misaligned regions; (ii) refinement planning, which preserves correctly generated entities, segments their regions across frames, and constructs targeted prompts for misaligned areas; and (iii) localized refinement, which selectively regenerates problematic regions while preserving faithful content through joint optimization of preserved and newly generated areas. On two benchmarks, EvalCrafter and T2V-CompBench with four recent T2V backbones, VideoRepair achieves substantial improvements over recent baselines across diverse alignment metrics. Comprehensive ablations further demonstrate the efficiency, robustness, and interpretability of our framework.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces VideoRepair, a training-free and model-agnostic framework for refining text-to-video (T2V) outputs. It consists of three stages: (i) MLLM-based misalignment detection via automatically generated questions to identify misaligned regions, (ii) refinement planning that preserves correct entities and segments regions, and (iii) localized refinement that regenerates only problematic areas through joint optimization. The central claim is that this yields substantial improvements over baselines on EvalCrafter and T2V-CompBench across four recent T2V backbones, supported by ablations on efficiency and robustness.

Significance. If the misalignment detection proves reliable and the gains are attributable to targeted preservation rather than generic resampling, the work could meaningfully advance training-free post-processing for T2V alignment, particularly for complex multi-object prompts. The model-agnostic design and emphasis on region preservation are practical strengths.

major comments (2)
  1. [misalignment detection] Misalignment detection section: No quantitative validation metrics (precision, recall, IoU, or F1) are reported for the MLLM-based detector against human-annotated ground truth on fine-grained errors (objects, attributes, relations). This is load-bearing for the central claim, as improvements on the benchmarks cannot be confidently attributed to precise localized correction without evidence that detection errors are rare.
  2. [experiments / ablations] Experiments section (ablation studies): While comprehensive ablations are mentioned, the contribution of the detection stage versus the planning and refinement stages is not isolated with controlled variants (e.g., random region selection or full regeneration baselines). This leaves open whether the reported gains on EvalCrafter and T2V-CompBench stem specifically from the self-correcting pipeline.
minor comments (2)
  1. [abstract] The abstract states 'substantial improvements' but provides no numerical values, error bars, or baseline details; these should be summarized with key metrics in the abstract for clarity.
  2. [localized refinement] Notation for region segmentation across frames and the joint optimization objective in the localized refinement stage could be formalized with equations to improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript accordingly to strengthen the validation of our claims.

read point-by-point responses
  1. Referee: Misalignment detection section: No quantitative validation metrics (precision, recall, IoU, or F1) are reported for the MLLM-based detector against human-annotated ground truth on fine-grained errors (objects, attributes, relations). This is load-bearing for the central claim, as improvements on the benchmarks cannot be confidently attributed to precise localized correction without evidence that detection errors are rare.

    Authors: We acknowledge that the current manuscript lacks direct quantitative metrics (e.g., precision, recall, F1) for the MLLM-based misalignment detector evaluated against human-annotated ground truth on fine-grained errors. The end-to-end gains on EvalCrafter and T2V-CompBench with four backbones, combined with robustness ablations, provide indirect support, but we agree this does not fully isolate detection reliability. We will add a dedicated human evaluation subsection reporting precision, recall, IoU, and F1 on a sampled set of videos with fine-grained annotations. revision: yes

  2. Referee: Experiments section (ablation studies): While comprehensive ablations are mentioned, the contribution of the detection stage versus the planning and refinement stages is not isolated with controlled variants (e.g., random region selection or full regeneration baselines). This leaves open whether the reported gains on EvalCrafter and T2V-CompBench stem specifically from the self-correcting pipeline.

    Authors: We agree that controlled variants isolating the detection stage (such as random region selection or full regeneration) are needed to attribute gains specifically to the pipeline. Our existing ablations cover component removals and efficiency, but do not include these exact baselines. We will add these experiments in the revised version, comparing VideoRepair against random-region and full-regeneration controls on the same benchmarks and backbones. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical framework with no derivation chain

full rationale

The paper presents a training-free, model-agnostic refinement pipeline evaluated empirically on EvalCrafter and T2V-CompBench. No equations, fitted parameters, predictions derived from inputs, or self-citations are described as load-bearing for the central claims. The three stages (misalignment detection via MLLM, refinement planning, localized refinement) are introduced as novel components without reducing to prior fitted values or self-referential definitions. The reported improvements are benchmark results, not outputs forced by construction from the method's own inputs. This is the expected non-finding for an applied empirical method paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract introduces no free parameters, domain axioms beyond standard machine-learning practice, or new invented entities.

pith-pipeline@v0.9.0 · 5771 in / 1216 out tokens · 58637 ms · 2026-05-23T08:23:00.410350+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. MotiMotion: Motion-Controlled Video Generation with Visual Reasoning

    cs.CV 2026-05 unverdicted novelty 7.0

    MotiMotion adds visual reasoning via a training-free VLM to refine primary trajectories and hallucinate secondary motions, plus a confidence-aware guidance scheme, yielding more plausible interactions on the new MotiB...

Reference graph

Works this paper leans on

64 extracted references · 64 canonical work pages · cited by 1 Pith paper · 11 internal anchors

  1. [1]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023. 2

  2. [2]

    The crystal ball hypoth- esis in diffusion models: Anticipating object positions from initial noise

    Yuanhao Ban, Ruochen Wang, Tianyi Zhou, Boqing Gong, Cho-Jui Hsieh, and Minhao Cheng. The crystal ball hypoth- esis in diffusion models: Anticipating object positions from initial noise. arXiv preprint arXiv:2406.01970, 2024. 2, 5

  3. [3]

    Multidiffusion: Fusing diffusion paths for controlled image generation

    Omer Bar-Tal, Lior Yariv, Yaron Lipman, and Tali Dekel. Multidiffusion: Fusing diffusion paths for controlled image generation. In Proceedings of the International Conference on Machine Learning (ICML), 2023. 5

  4. [4]

    Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

    Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127, 2023. 2

  5. [5]

    Videocrafter2: Overcoming data limitations for high-quality video diffusion models

    Haoxin Chen, Yong Zhang, Xiaodong Cun, Menghan Xia, Xintao Wang, Chao Weng, and Ying Shan. Videocrafter2: Overcoming data limitations for high-quality video diffusion models. arXiv preprint arXiv:2401.09047, 2024. 2, 7

  6. [6]

    arXiv preprint arXiv:2305.06558 (2023)

    Yangming Cheng, Liulei Li, Yuanyou Xu, Xiaodi Li, Zongxin Yang, Wenguan Wang, and Yi Yang. Segment and track anything. arXiv preprint arXiv:2305.06558, 2023. 6

  7. [7]

    Davidsonian scene graph: Improving relia- bility in fine-grained evaluation for text-image generation

    Jaemin Cho, Yushi Hu, Roopal Garg, Peter Anderson, Ranjay Krishna, Jason Baldridge, Mohit Bansal, Jordi Pont-Tuset, and Su Wang. Davidsonian scene graph: Improving relia- bility in fine-grained evaluation for text-image generation. In Proceedings of the International Conference on Learning Representations (ICLR), 2024. 2, 3, 7, 12

  8. [8]

    Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models

    Matt Deitke, Christopher Clark, Sangho Lee, Rohun Tripathi, Yue Yang, Jae Sung Park, Mohammadreza Salehi, Niklas Muennighoff, Kyle Lo, Luca Soldaini, et al. Molmo and pixmo: Open weights and open data for state-of-the-art multi- modal models. arXiv preprint arXiv:2409.17146, 2024. 5

  9. [9]

    Structure and content-guided video synthesis with diffusion models

    Patrick Esser, Johnathan Chiu, Parmida Atighehchian, Jonathan Granskog, and Anastasis Germanidis. Structure and content-guided video synthesis with diffusion models. In Proceedings of the International Conference on Computer Vision (ICCV), 2023. 2

  10. [10]

    CLIPScore: A reference-free evaluation metric for image captioning

    Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. CLIPScore: A reference-free evaluation metric for image captioning. In Proceedings of the 2021 Con- ference on Empirical Methods in Natural Language Process- ing, pages 7514–7528, Online and Punta Cana, Dominican Republic, 2021. Association for Computational Linguistics. 8

  11. [11]

    Denoising diffu- sion probabilistic models

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffu- sion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020. 2 9

  12. [12]

    Video diffusion mod- els

    Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J Fleet. Video diffusion mod- els. In Advances in Neural Information Processing Systems (NeurIPS), 2022. 2

  13. [13]

    CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers

    Wenyi Hong, Ming Ding, Wendi Zheng, Xinghan Liu, and Jie Tang. Cogvideo: Large-scale pretraining for text-to-video gen- eration via transformers. arXiv preprint arXiv:2205.15868,

  14. [14]

    Zeroscope, 2023

    huggingface. Zeroscope, 2023. 6, 7

  15. [15]

    Text2video-zero: Text- to-image diffusion models are zero-shot video generators

    Levon Khachatryan, Andranik Movsisyan, Vahram Tade- vosyan, Roberto Henschel, Zhangyang Wang, Shant Navasardyan, and Humphrey Shi. Text2video-zero: Text- to-image diffusion models are zero-shot video generators. In Proceedings of the International Conference on Computer Vision (ICCV), 2023. 2

  16. [16]

    Semantic-sam: Segment and recognize anything at any granu- larity

    Feng Li, Hao Zhang, Peize Sun, Xueyan Zou, Shilong Liu, Jianwei Yang, Chunyuan Li, Lei Zhang, and Jianfeng Gao. Semantic-sam: Segment and recognize anything at any granu- larity. arXiv preprint arXiv:2307.04767, 2023. 5

  17. [17]

    Blip- 2: Bootstrapping language-image pre-training with frozen image encoders and large language models

    Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip- 2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In International conference on machine learning, pages 19730–19742. PMLR,

  18. [18]

    Selma: Learning and merging skill-specific text- to-image experts with auto-generated data

    Jialu Li, Jaemin Cho, Yi-Lin Sung, Jaehong Yoon, and Mohit Bansal. Selma: Learning and merging skill-specific text- to-image experts with auto-generated data. In Advances in Neural Information Processing Systems (NeurIPS), 2024. 3

  19. [19]

    T2v-turbo: Break- ing the quality bottleneck of video consistency model with mixed reward feedback

    Jiachen Li, Weixi Feng, Tsu-Jui Fu, Xinyi Wang, Sugato Basu, Wenhu Chen, and William Yang Wang. T2v-turbo: Break- ing the quality bottleneck of video consistency model with mixed reward feedback. In Advances in Neural Information Processing Systems (NeurIPS), 2024. 2, 7, 13

  20. [20]

    Gligen: Open-set grounded text-to-image generation

    Yuheng Li, Haotian Liu, Qingyang Wu, Fangzhou Mu, Jian- wei Yang, Jianfeng Gao, Chunyuan Li, and Yong Jae Lee. Gligen: Open-set grounded text-to-image generation. In Pro- ceedings of the IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), 2023. 2, 3, 5, 12

  21. [21]

    Llm-grounded diffusion: Enhancing prompt understanding of text-to-image diffusion models with large language models,

    Long Lian, Boyi Li, Adam Yala, and Trevor Darrell. Llm- grounded diffusion: Enhancing prompt understanding of text- to-image diffusion models with large language models. arXiv preprint arXiv:2305.13655, 2023. 7

  22. [22]

    Llm-grounded video diffusion models

    Long Lian, Baifeng Shi, Adam Yala, Trevor Darrell, and Boyi Li. Llm-grounded video diffusion models. In Proceedings of the International Conference on Learning Representations (ICLR), 2024. 3

  23. [23]

    Videodirectorgpt: Consistent multi-scene video generation via llm-guided planning

    Han Lin, Abhay Zala, Jaemin Cho, and Mohit Bansal. Videodirectorgpt: Consistent multi-scene video generation via llm-guided planning. arXiv preprint arXiv:2309.15091,

  24. [24]

    Improved baselines with visual instruction tuning

    Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. In Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26296–26306, 2024. 7

  25. [25]

    Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection

    Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499, 2023. 7

  26. [26]

    Evalcrafter: Benchmarking and eval- uating large video generation models

    Yaofang Liu, Xiaodong Cun, Xuebo Liu, Xintao Wang, Yong Zhang, Haoxin Chen, Yang Liu, Tieyong Zeng, Raymond Chan, and Ying Shan. Evalcrafter: Benchmarking and eval- uating large video generation models. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), 2024. 2, 5, 6, 13

  27. [27]

    Videofusion: Decomposed diffusion models for high- quality video generation

    Zhengxiong Luo, Dayou Chen, Yingya Zhang, Yan Huang, Liang Wang, Yujun Shen, Deli Zhao, Jingren Zhou, and Tie- niu Tan. Videofusion: Decomposed diffusion models for high- quality video generation. arXiv preprint arXiv:2303.08320,

  28. [28]

    Gpt4motion: Scripting physical motions in text-to-video generation via blender-oriented gpt planning

    Jiaxi Lv, Yi Huang, Mingfu Yan, Jiancheng Huang, Jianzhuang Liu, Yifan Liu, Yafei Wen, Xiaoxin Chen, and Shifeng Chen. Gpt4motion: Scripting physical motions in text-to-video generation via blender-oriented gpt planning. In Proceedings of the IEEE International Conference on Com- puter Vision and Pattern Recognition (CVPR), 2024. 3

  29. [29]

    Latte: Latent Diffusion Transformer for Video Generation

    Xin Ma, Yaohui Wang, Gengyun Jia, Xinyuan Chen, Ziwei Liu, Yuan-Fang Li, Cunjian Chen, and Yu Qiao. Latte: Latent diffusion transformer for video generation. arXiv preprint arXiv:2401.03048, 2024. 6, 7

  30. [30]

    Improving text-to- image consistency via automatic prompt optimization

    Oscar Ma ˜nas, Pietro Astolfi, Melissa Hall, Candace Ross, Jack Urbanek, Adina Williams, Aishwarya Agrawal, Adriana Romero-Soriano, and Michal Drozdzal. Improving text-to- image consistency via automatic prompt optimization. arXiv preprint arXiv:2403.17804, 2024. 2, 3, 5, 6, 7, 12, 13, 14, 21

  31. [31]

    Guided image synthesis via initial image editing in diffusion model

    Jiafeng Mao, Xueting Wang, and Kiyoharu Aizawa. Guided image synthesis via initial image editing in diffusion model. In Proceedings of the 31st ACM International Conference on Multimedia, pages 5321–5329, 2023. 2, 5

  32. [32]

    Hello gpt-4o, 2024

    OpenAI. Hello gpt-4o, 2024. 2, 7

  33. [33]

    GPT-4 technical report, 2024

    OpenAI. GPT-4 technical report, 2024. 7

  34. [35]

    Bleu: a method for automatic evaluation of machine translation

    Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318,

  35. [36]

    Open-sora-plan, 2023

    PKU-Yuan Lab and Tuzhan AI etc. Open-sora-plan, 2023. 6, 7

  36. [37]

    SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

    Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas M¨uller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952, 2023. 6

  37. [38]

    Not all noises are created equally: Diffusion noise selection and optimization

    Zipeng Qi, Lichen Bai, Haoyi Xiong, et al. Not all noises are created equally: Diffusion noise selection and optimization. arXiv preprint arXiv:2407.14041, 2024. 2, 5

  38. [39]

    Learning 10 transferable visual models from natural language supervi- sion

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning 10 transferable visual models from natural language supervi- sion. In International conference on machine learning, pages 8748–8763. PMLR, 2021. 6, 13

  39. [40]

    High-resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), 2022. 2

  40. [41]

    Improved techniques for training gans

    Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans. Advances in neural information processing systems, 29, 2016. 6, 13

  41. [42]

    Make-A-Video: Text-to-Video Generation without Text-Video Data

    Uriel Singer, Adam Polyak, Thomas Hayes, Xi Yin, Jie An, Songyang Zhang, Qiyuan Hu, Harry Yang, Oron Ashual, Oran Gafni, et al. Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:2209.14792,

  42. [43]

    Consistency Models

    Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever. Consistency models. arXiv preprint arXiv:2303.01469, 2023. 2

  43. [44]

    Dreamsync: Aligning text- to-image generation with image understanding feedback

    Jiao Sun, Deqing Fu, Yushi Hu, Su Wang, Royi Rassin, Da-Cheng Juan, Dana Alon, Charles Herrmann, Sjoerd van Steenkiste, Ranjay Krishna, et al. Dreamsync: Aligning text- to-image generation with image understanding feedback. In Synthetic Data for Computer Vision Workshop@ CVPR 2024,

  44. [45]

    T2v-compbench: A comprehen- sive benchmark for compositional text-to-video generation

    Kaiyue Sun, Kaiyi Huang, Xian Liu, Yue Wu, Zihan Xu, Zhenguo Li, and Xihui Liu. T2v-compbench: A comprehen- sive benchmark for compositional text-to-video generation. arXiv preprint arXiv:2407.14505, 2024. 2, 6, 13

  45. [46]

    Spatial- aware latent initialization for controllable image generation

    Wenqiang Sun, Teng Li, Zehong Lin, and Jun Zhang. Spatial- aware latent initialization for controllable image generation. arXiv preprint arXiv:2401.16157, 2024. 2, 5

  46. [47]

    Raft: Recurrent all-pairs field transforms for optical flow

    Zachary Teed and Jia Deng. Raft: Recurrent all-pairs field transforms for optical flow. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part II 16, pages 402–419. Springer, 2020. 6, 13

  47. [48]

    Videotetris: Towards compositional text-to-video generation

    Ye Tian, Ling Yang, Haotian Yang, Yuan Gao, Yufan Deng, Jingmin Chen, Xintao Wang, Zhaochen Yu, Xin Tao, Pengfei Wan, et al. Videotetris: Towards compositional text-to-video generation. In Advances in Neural Information Processing Systems (NeurIPS), 2024. 6, 7

  48. [49]

    ModelScope Text-to-Video Technical Report

    Jiuniu Wang, Hangjie Yuan, Dayou Chen, Yingya Zhang, Xiang Wang, and Shiwei Zhang. Modelscope text-to-video technical report. arXiv preprint arXiv:2308.06571, 2023. 2, 6, 7

  49. [50]

    Videomae v2: Scaling video masked autoencoders with dual masking

    Limin Wang, Bingkun Huang, Zhiyu Zhao, Zhan Tong, Yi- nan He, Yi Wang, Yali Wang, and Yu Qiao. Videomae v2: Scaling video masked autoencoders with dual masking. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14549–14560, 2023. 6, 13

  50. [51]

    Videolcm: Video latent consistency model, 2023

    Xiang Wang, Shiwei Zhang, Han Zhang, Yu Liu, Yingya Zhang, Changxin Gao, and Nong Sang. Videolcm: Video latent consistency model, 2023. 2

  51. [52]

    A recipe for scaling up text-to-video generation with text-free videos

    Xiang Wang, Shiwei Zhang, Hangjie Yuan, Zhiwu Qing, Biao Gong, Yingya Zhang, Yujun Shen, Changxin Gao, and Nong Sang. A recipe for scaling up text-to-video generation with text-free videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6572– 6582, 2024. 2

  52. [53]

    Exploring video quality assessment on user generated contents from aesthetic and technical perspectives

    Haoning Wu, Erli Zhang, Liang Liao, Chaofeng Chen, Jing- wen Hou, Annan Wang, Wenxiu Sun, Qiong Yan, and Weisi Lin. Exploring video quality assessment on user generated contents from aesthetic and technical perspectives. In Pro- ceedings of the IEEE/CVF International Conference on Com- puter Vision, pages 20144–20154, 2023. 6, 13

  53. [54]

    Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation

    Jay Zhangjie Wu, Yixiao Ge, Xintao Wang, Stan Weixian Lei, Yuchao Gu, Yufei Shi, Wynne Hsu, Ying Shan, Xiaohu Qie, and Mike Zheng Shou. Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation. In Proceedings of the International Conference on Computer Vision (ICCV), 2023. 2

  54. [55]

    Self-correcting llm-controlled diffusion models

    Tsung-Han Wu, Long Lian, Joseph E Gonzalez, Boyi Li, and Trevor Darrell. Self-correcting llm-controlled diffusion models. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), 2024. 2, 3, 4, 6, 7, 12, 13, 14

  55. [56]

    Compositional video gener- ation as flow equalization

    Xingyi Yang and Xinchao Wang. Compositional video gener- ation as flow equalization. arXiv preprint arXiv:2407.06182,

  56. [57]

    CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

    Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiao- han Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer. arXiv preprint arXiv:2408.06072, 2024. 2, 7

  57. [58]

    Show-1: Marrying pixel and latent diffusion models for text-to-video generation

    David Junhao Zhang, Jay Zhangjie Wu, Jia-Wei Liu, Rui Zhao, Lingmin Ran, Yuchao Gu, Difei Gao, and Mike Zheng Shou. Show-1: Marrying pixel and latent diffusion models for text-to-video generation. International Journal of Computer Vision, pages 1–15, 2024. 6, 7

  58. [59]

    the words ‘KEEP OFF THE GRASS

    Yabo Zhang, Yuxiang Wei, Dongsheng Jiang, Xiaopeng Zhang, Wangmeng Zuo, and Qi Tian. Controlvideo: Training- free controllable text-to-video generation. arXiv preprint arXiv:2305.13077, 2023. 2 11 Appendix A. V IDEO REPAIR Implementation Details 12 A.1. Question Generation . . . . . . . . . . . . . 12 A.2. Visual Question Answering . . . . . . . . . 12 A....

  59. [60]

    For ex- ample, video ranking is not applied when K = 1, and only one refinement is produced using a single random seed noise ϵ′

  60. [61]

    As depicted in Fig

    For ranking metrics, we rely on DSG Obj across all ab- lation studies. As depicted in Fig. 8, higher K values (5, 10, and 20) consistently yield higher scores across all cate- gories than K = 1. This trend is particularly prominent in the ‘count’ category, where increasingK leads to noticeable performance improvements, highlighting the importance of consi...

  61. [62]

    {cur_question}

    Given the question: "{cur_question}", provide a brief reasoning (up to two sentences) to determine the accurate answer

  62. [63]

    Yes" and 0.0 for

    Respond to the question using binary values: 1.0 for "Yes" and 0.0 for "No". If the answer is uncertain or unnatural due to image distortion or other issues, respond with 0.0 ("No")

  63. [64]

    {key_objects}

    Return the number of "{key_objects}" (as an integer) mentioned in the initial prompt "{cur_question}"

  64. [65]

    {key_objects}

    Return the number of "{key_objects}" (as an integer) in the provided image. Return the result as a dictionary in the following format (not in JSON format): {{"Q": "<question>", "A": <binary answer>, "reasoning": "<brief reasoning>", "obj_in_prompt": <number of key object mentioned in the initial prompt>, "obj_in_img": <number of key object in the image>}}...