Self-Correcting Text-to-Video Generation with Misalignment Detection and Localized Refinement

Daeun Lee; Jaehong Yoon; Jaemin Cho; Mohit Bansal

arxiv: 2411.15115 · v3 · submitted 2024-11-22 · 💻 cs.CV · cs.AI· cs.CL

Self-Correcting Text-to-Video Generation with Misalignment Detection and Localized Refinement

Daeun Lee , Jaehong Yoon , Jaemin Cho , Mohit Bansal This is my paper

Pith reviewed 2026-05-23 08:23 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.CL

keywords text-to-video generationvideo refinementmisalignment detectionlocalized refinementtraining-free methodmodel-agnosticdiffusion modelsprompt alignment

0 comments

The pith

VideoRepair detects fine-grained text-video misalignments and performs targeted local refinements while preserving correct regions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents VideoRepair as a training-free framework that identifies where generated videos fail to match complex text prompts and corrects only those parts. This approach matters because full regeneration often discards accurate content and current T2V models frequently misalign on prompts with multiple objects or relations. The method relies on an MLLM to spot issues through auto-generated questions, then plans refinements that keep faithful areas intact before selectively regenerating the rest. If the central claim holds, existing diffusion-based video generators can produce better-aligned outputs across different base models without additional training. The work shows gains on two standard benchmarks using four recent backbones.

Core claim

The authors claim that a three-stage process—MLLM-driven misalignment detection with automatically generated questions, refinement planning that segments and preserves correct entities across frames, and joint optimization for localized regeneration—enables self-correction of text-to-video outputs. This process is model-agnostic and training-free, and experiments on EvalCrafter and T2V-CompBench demonstrate substantial gains in alignment metrics over recent baselines.

What carries the argument

Region-preserving refinement strategy with misalignment detection via MLLM, refinement planning, and localized refinement.

If this is right

The method improves alignment metrics across diverse prompts on EvalCrafter and T2V-CompBench.
It works without retraining when applied to four different recent T2V diffusion backbones.
It reduces unnecessary changes by keeping correctly generated regions intact during refinement.
Ablations confirm the framework remains efficient and produces interpretable correction steps.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same detection-plus-local-fix pattern could be tested on text-to-image models that also struggle with complex attribute binding.
If the MLLM detector generalizes, it might serve as an automatic quality filter before any refinement step.
The preservation of correct regions suggests a route to lower inference cost compared with full video regeneration.

Load-bearing premise

The MLLM-based evaluation with automatically generated questions reliably identifies which regions of the video are misaligned with the text prompt.

What would settle it

Running the detection stage on a set of videos with known, human-verified misalignments and finding that the MLLM fails to flag the actual mismatched regions at rates above random selection.

Figures

Figures reproduced from arXiv: 2411.15115 by Daeun Lee, Jaehong Yoon, Jaemin Cho, Mohit Bansal.

**Figure 1.** Figure 1: VIDEOREPAIR is a model-agnostic, training-free, automatic refinement framework for improving alignments in text-to-video generation. Given an initial video from a text-to-video generation model, VIDEOREPAIR refines video in two stages: (1) video refinement planning and (2) localized refinement. The black-white mask in the bottom left of each example indicates the localized refinement plan (black: regions t… view at source ↗

**Figure 2.** Figure 2: Comparison of different refinement methods for alignment. (a) Prompt optimization (e.g., OPT2I [30]) by LLM-based rewriting without visual/fine-grained feedback, making the search expensive (e.g., 30 iterations). (b) Recent work on localized feedback (e.g., SLD [55]) provides visual guidance but relies on an external layout-guided generation module, often leading to unnatural refinements. (c) VIDEOREPAI… view at source ↗

**Figure 3.** Figure 3: Illustration of VIDEOREPAIR. VIDEOREPAIR refines the generated video in two stages: (1) video refinement planning (Sec. 3.1), (2) localized refinement (Sec. 3.2). Given the prompt p, we first generate a fine-grained evaluation question set and ask the MLLM to provide answers. Next, we identify accurately generated objects O ∗ and plan the refinement p r of other regions using MLLM/LLM. Based on O ∗ , we de… view at source ↗

**Figure 4.** Figure 4: Videos generated with T2V-turbo and refinement frameworks (OPT2I / SLD / VIDEOREPAIR) on T2V-turbo. VIDEOREPAIR successfully addresses object and attribute misalignment issues (e.g., numeracy, spatial relationship, attribute blending) compared to T2Vturbo and other refinement methods. More visualization examples with T2V-turbo and VideoCrafter2 are provided in the appendix. A family of four set up a tent … view at source ↗

**Figure 5.** Figure 5: The iterative refinement of VIDEOREPAIR. Videos in each column represent the outputs of successive refinement iterations, where the output from the previous step serves as the input for the current step. The text at the bottom of each video row indicates the corresponding text prompt. More visualization examples are provided in the appendix. while maintaining the integrity of multi-object generation. In ad… view at source ↗

**Figure 6.** Figure 6: Refining videos when the key object disappears. VIDEOREPAIR successfully preserves disappearing objects (car) while incorporating previously missed objects (house). (i.e., K = 1), compared to strong baselines (LLM paraphrasing: 44.7, SLD: 44.5, OPT2I: 45.7 in average), highlighting the effectiveness of VIDEOREPAIR refinement process. Furthermore, we enhance text-video alignment by incorporating video ra… view at source ↗

**Figure 7.** Figure 7: Comparison of DSG and DSGObj . Compared to DSGObj (ours), DSG does not penalize the video even if more than one object (e.g., 1 bear in this case) is generated when the target object count = 1 in the text prompt. A.2. Visual Question Answering To evaluate the generated videos, we utilize GPT-4o to answer both count-related (Qo c ) and attribute-related (Qo a ) questions, as illustrated in [PITH_FULL_IMA… view at source ↗

**Figure 8.** Figure 8: Impact of the number of video candidates. We vary the number of video candidates K as 1, 5, 10, and 20 for ranking. 13 [PITH_FULL_IMAGE:figures/full_fig_p013_8.png] view at source ↗

**Figure 10.** Figure 10: Single-object mask vs. Multi-object mask. serves the O∗ areas while refining the remaining regions using p r . For instance, in [PITH_FULL_IMAGE:figures/full_fig_p014_10.png] view at source ↗

**Figure 11.** Figure 11: Output from each step of VIDEOREPAIR. We illustrate whole outputs from each step of VIDEOREPAIR. {'Q': 'Are there four children?', 'A': 1.0, 'reasoning': 'There are four visible children in the image.', 'obj_in_prompt': 4, 'obj_in_img': 4} {'Q': 'Are there three dogs?', 'A': 0.0, 'reasoning': 'There is only one dog visible in the image.', 'obj_in_prompt': 3, 'obj_in_img': 1} {'Q': 'Is there a picnic?', 'A… view at source ↗

**Figure 12.** Figure 12: Output from each step of VIDEOREPAIR. We illustrate whole outputs from each step of VIDEOREPAIR. 15 [PITH_FULL_IMAGE:figures/full_fig_p015_12.png] view at source ↗

**Figure 13.** Figure 13: Qualitative examples from T2V-turbo. 1 bear and 2 people making pizza T2V-turbo OPT2I (Iter=10) SLD (Iter=1) VideoRepair (Iter=1) Frame 1 Frame 8 Frame 16 Frame 1 Frame 8 Frame 16 Vico Teddy bear and 3 real bear [PITH_FULL_IMAGE:figures/full_fig_p016_13.png] view at source ↗

**Figure 14.** Figure 14: Qualitative examples from T2V-turbo. 16 [PITH_FULL_IMAGE:figures/full_fig_p016_14.png] view at source ↗

**Figure 15.** Figure 15: Qualitative examples from T2V-turbo. five aliens in a forest T2V-turbo OPT2I (Iter=10) SLD (Iter=1) VideoRepair (Iter=1) Frame 1 Frame 8 Frame 16 Frame 1 Frame 8 Frame 16 Vico Five colorful parrots perch on a branch, squawking loudly at each other [PITH_FULL_IMAGE:figures/full_fig_p017_15.png] view at source ↗

**Figure 16.** Figure 16: Qualitative examples from T2V-turbo. 17 [PITH_FULL_IMAGE:figures/full_fig_p017_16.png] view at source ↗

**Figure 17.** Figure 17: Qualitative examples from VideoCrafter2. A dog sitting under a umbrella on a sunny beach OPT2I (Iter=10) SLD (Iter=1) VideoRepair (Iter=1) Frame 1 Frame 8 Frame 16 Frame 1 Frame 8 Frame 16 Vico With the style of pointilism, A green apple and a black backpack. VideoCrafter2 [PITH_FULL_IMAGE:figures/full_fig_p018_17.png] view at source ↗

**Figure 18.** Figure 18: Qualitative examples from VideoCrafter2. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_18.png] view at source ↗

**Figure 19.** Figure 19: Qualitative examples from VideoCrafter2. A team of two marine biologists study a colony of penguins, monitoring their breeding habits. Iteration 1~3 A group of six dancers perform a ballet on stage, their movements synchronized and graceful. Iteration 1~3 A bright yellow umbrella with a wooden handle. It's compact and easy to carry. A mother and her child feed ducks at a pond. Five cows graze lazily in a … view at source ↗

**Figure 20.** Figure 20: Videos generated using iterative refinement with VIDEOREPAIR. We depict iterative refinement results generated from T2V-Turbo. Overall, VIDEOREPAIR progressively enhances text-video alignment with each refinement step. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_20.png] view at source ↗

**Figure 21.** Figure 21: Prompts to perform visual question answering in video evaluation steps. Top: The prompt for Q o c (count-related question), Bottom: prompt for Q o a (attribute-related question). cur question means each DSGObj question and key objects means entity word in each question. Given the image which compose of multiple concatenated frames from a video and the list of questionanswer pairs for each object, represe… view at source ↗

**Figure 22.** Figure 22: Prompt to choose which object(s) to preserve. We ask GPT4o to select objects to preserve in the scene. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_22.png] view at source ↗

**Figure 23.** Figure 23: Prompt to plan how to refine the other regions. We use five in-context examples to create the refinement prompt from the question related to other objects. Generate 1 paraphrase of the following image description while keeping the semantic meaning: "{init_prompt}". Provide your response as a single phrase without any explanation. Format it as: <PROMPT> ... </PROMPT>. (e.g., <PROMPT>Two dogs and a whale em… view at source ↗

**Figure 24.** Figure 24: Prompt for LLM paraphrasing. Following OPT2I [30], we ask GPT4 to generate diverse paraphrases of each prompt for LLM paraphrasing baseline experiments. 21 [PITH_FULL_IMAGE:figures/full_fig_p021_24.png] view at source ↗

read the original abstract

Recent text-to-video (T2V) diffusion models have made remarkable progress in generating high-quality videos. However, they often struggle to align with complex text prompts, particularly when multiple objects, attributes, or spatial relations are specified. We introduce VideoRepair, the first self-correcting, training-free, and model-agnostic video refinement framework that automatically detects fine-grained text-video misalignments and performs targeted, localized corrections. Our key insight is that even misaligned videos usually contain correctly generated regions that should be preserved rather than regenerated. Building on this observation, VideoRepair proposes a novel region-preserving refinement strategy with three stages: (i) misalignment detection, where MLLM-based evaluation with automatically generated evaluation questions identifies misaligned regions; (ii) refinement planning, which preserves correctly generated entities, segments their regions across frames, and constructs targeted prompts for misaligned areas; and (iii) localized refinement, which selectively regenerates problematic regions while preserving faithful content through joint optimization of preserved and newly generated areas. On two benchmarks, EvalCrafter and T2V-CompBench with four recent T2V backbones, VideoRepair achieves substantial improvements over recent baselines across diverse alignment metrics. Comprehensive ablations further demonstrate the efficiency, robustness, and interpretability of our framework.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

VideoRepair offers a training-free three-stage pipeline to detect T2V misalignments with an MLLM and refine only the bad regions while preserving the rest, but the detection step lacks any reported validation against human labels.

read the letter

The paper introduces VideoRepair as a self-correcting framework for text-to-video generation. It uses an MLLM to spot fine-grained misalignments via auto-generated questions, plans refinements that keep correct entities and their regions intact across frames, and then applies localized regeneration with joint optimization of preserved and new content. This region-preserving approach is the main new element and does not obviously collapse into prior post-processing or resampling methods cited in the abstract. The work does a reasonable job targeting a known practical weakness in current T2V models on complex prompts involving multiple objects, attributes, and relations, and it tests the pipeline across four different backbones on EvalCrafter and T2V-CompBench. That model-agnostic and training-free framing is useful for people who cannot retrain large models. The soft spot is the misalignment detection stage. The abstract and stress-test note give no quantitative check of how well the MLLM questions identify the actual misaligned regions against human ground truth, such as IoU or F1 scores. Without that, it is hard to tell whether reported gains come from precise, targeted fixes or from incidental extra generation. If detection errors are frequent, the later stages could either miss problems or alter good areas, which undercuts the central claim. The paper is aimed at researchers and engineers working on generative video who need better prompt alignment without new training runs. A reader focused on post-hoc correction techniques would find the pipeline details worth examining if the full results include detection metrics and ablations. I would send it for peer review so the numbers and validation can be checked directly.

Referee Report

2 major / 2 minor

Summary. The paper introduces VideoRepair, a training-free and model-agnostic framework for refining text-to-video (T2V) outputs. It consists of three stages: (i) MLLM-based misalignment detection via automatically generated questions to identify misaligned regions, (ii) refinement planning that preserves correct entities and segments regions, and (iii) localized refinement that regenerates only problematic areas through joint optimization. The central claim is that this yields substantial improvements over baselines on EvalCrafter and T2V-CompBench across four recent T2V backbones, supported by ablations on efficiency and robustness.

Significance. If the misalignment detection proves reliable and the gains are attributable to targeted preservation rather than generic resampling, the work could meaningfully advance training-free post-processing for T2V alignment, particularly for complex multi-object prompts. The model-agnostic design and emphasis on region preservation are practical strengths.

major comments (2)

[misalignment detection] Misalignment detection section: No quantitative validation metrics (precision, recall, IoU, or F1) are reported for the MLLM-based detector against human-annotated ground truth on fine-grained errors (objects, attributes, relations). This is load-bearing for the central claim, as improvements on the benchmarks cannot be confidently attributed to precise localized correction without evidence that detection errors are rare.
[experiments / ablations] Experiments section (ablation studies): While comprehensive ablations are mentioned, the contribution of the detection stage versus the planning and refinement stages is not isolated with controlled variants (e.g., random region selection or full regeneration baselines). This leaves open whether the reported gains on EvalCrafter and T2V-CompBench stem specifically from the self-correcting pipeline.

minor comments (2)

[abstract] The abstract states 'substantial improvements' but provides no numerical values, error bars, or baseline details; these should be summarized with key metrics in the abstract for clarity.
[localized refinement] Notation for region segmentation across frames and the joint optimization objective in the localized refinement stage could be formalized with equations to improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript accordingly to strengthen the validation of our claims.

read point-by-point responses

Referee: Misalignment detection section: No quantitative validation metrics (precision, recall, IoU, or F1) are reported for the MLLM-based detector against human-annotated ground truth on fine-grained errors (objects, attributes, relations). This is load-bearing for the central claim, as improvements on the benchmarks cannot be confidently attributed to precise localized correction without evidence that detection errors are rare.

Authors: We acknowledge that the current manuscript lacks direct quantitative metrics (e.g., precision, recall, F1) for the MLLM-based misalignment detector evaluated against human-annotated ground truth on fine-grained errors. The end-to-end gains on EvalCrafter and T2V-CompBench with four backbones, combined with robustness ablations, provide indirect support, but we agree this does not fully isolate detection reliability. We will add a dedicated human evaluation subsection reporting precision, recall, IoU, and F1 on a sampled set of videos with fine-grained annotations. revision: yes
Referee: Experiments section (ablation studies): While comprehensive ablations are mentioned, the contribution of the detection stage versus the planning and refinement stages is not isolated with controlled variants (e.g., random region selection or full regeneration baselines). This leaves open whether the reported gains on EvalCrafter and T2V-CompBench stem specifically from the self-correcting pipeline.

Authors: We agree that controlled variants isolating the detection stage (such as random region selection or full regeneration) are needed to attribute gains specifically to the pipeline. Our existing ablations cover component removals and efficiency, but do not include these exact baselines. We will add these experiments in the revised version, comparing VideoRepair against random-region and full-regeneration controls on the same benchmarks and backbones. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical framework with no derivation chain

full rationale

The paper presents a training-free, model-agnostic refinement pipeline evaluated empirically on EvalCrafter and T2V-CompBench. No equations, fitted parameters, predictions derived from inputs, or self-citations are described as load-bearing for the central claims. The three stages (misalignment detection via MLLM, refinement planning, localized refinement) are introduced as novel components without reducing to prior fitted values or self-referential definitions. The reported improvements are benchmark results, not outputs forced by construction from the method's own inputs. This is the expected non-finding for an applied empirical method paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract introduces no free parameters, domain axioms beyond standard machine-learning practice, or new invented entities.

pith-pipeline@v0.9.0 · 5771 in / 1216 out tokens · 58637 ms · 2026-05-23T08:23:00.410350+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

MotiMotion: Motion-Controlled Video Generation with Visual Reasoning
cs.CV 2026-05 unverdicted novelty 7.0

MotiMotion adds visual reasoning via a training-free VLM to refine primary trajectories and hallucinate secondary motions, plus a confidence-aware guidance scheme, yielding more plausible interactions on the new MotiB...

Reference graph

Works this paper leans on

64 extracted references · 64 canonical work pages · cited by 1 Pith paper · 11 internal anchors

[1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023. 2

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

The crystal ball hypoth- esis in diffusion models: Anticipating object positions from initial noise

Yuanhao Ban, Ruochen Wang, Tianyi Zhou, Boqing Gong, Cho-Jui Hsieh, and Minhao Cheng. The crystal ball hypoth- esis in diffusion models: Anticipating object positions from initial noise. arXiv preprint arXiv:2406.01970, 2024. 2, 5

work page arXiv 2024
[3]

Multidiffusion: Fusing diffusion paths for controlled image generation

Omer Bar-Tal, Lior Yariv, Yaron Lipman, and Tali Dekel. Multidiffusion: Fusing diffusion paths for controlled image generation. In Proceedings of the International Conference on Machine Learning (ICML), 2023. 5

work page 2023
[4]

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127, 2023. 2

work page internal anchor Pith review Pith/arXiv arXiv 2023
[5]

Videocrafter2: Overcoming data limitations for high-quality video diffusion models

Haoxin Chen, Yong Zhang, Xiaodong Cun, Menghan Xia, Xintao Wang, Chao Weng, and Ying Shan. Videocrafter2: Overcoming data limitations for high-quality video diffusion models. arXiv preprint arXiv:2401.09047, 2024. 2, 7

work page arXiv 2024
[6]

arXiv preprint arXiv:2305.06558 (2023)

Yangming Cheng, Liulei Li, Yuanyou Xu, Xiaodi Li, Zongxin Yang, Wenguan Wang, and Yi Yang. Segment and track anything. arXiv preprint arXiv:2305.06558, 2023. 6

work page arXiv 2023
[7]

Davidsonian scene graph: Improving relia- bility in fine-grained evaluation for text-image generation

Jaemin Cho, Yushi Hu, Roopal Garg, Peter Anderson, Ranjay Krishna, Jason Baldridge, Mohit Bansal, Jordi Pont-Tuset, and Su Wang. Davidsonian scene graph: Improving relia- bility in fine-grained evaluation for text-image generation. In Proceedings of the International Conference on Learning Representations (ICLR), 2024. 2, 3, 7, 12

work page 2024
[8]

Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models

Matt Deitke, Christopher Clark, Sangho Lee, Rohun Tripathi, Yue Yang, Jae Sung Park, Mohammadreza Salehi, Niklas Muennighoff, Kyle Lo, Luca Soldaini, et al. Molmo and pixmo: Open weights and open data for state-of-the-art multi- modal models. arXiv preprint arXiv:2409.17146, 2024. 5

work page internal anchor Pith review Pith/arXiv arXiv 2024
[9]

Structure and content-guided video synthesis with diffusion models

Patrick Esser, Johnathan Chiu, Parmida Atighehchian, Jonathan Granskog, and Anastasis Germanidis. Structure and content-guided video synthesis with diffusion models. In Proceedings of the International Conference on Computer Vision (ICCV), 2023. 2

work page 2023
[10]

CLIPScore: A reference-free evaluation metric for image captioning

Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. CLIPScore: A reference-free evaluation metric for image captioning. In Proceedings of the 2021 Con- ference on Empirical Methods in Natural Language Process- ing, pages 7514–7528, Online and Punta Cana, Dominican Republic, 2021. Association for Computational Linguistics. 8

work page 2021
[11]

Denoising diffu- sion probabilistic models

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffu- sion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020. 2 9

work page 2020
[12]

Video diffusion mod- els

Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J Fleet. Video diffusion mod- els. In Advances in Neural Information Processing Systems (NeurIPS), 2022. 2

work page 2022
[13]

CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers

Wenyi Hong, Ming Ding, Wendi Zheng, Xinghan Liu, and Jie Tang. Cogvideo: Large-scale pretraining for text-to-video gen- eration via transformers. arXiv preprint arXiv:2205.15868,

work page internal anchor Pith review Pith/arXiv arXiv
[14]

Zeroscope, 2023

huggingface. Zeroscope, 2023. 6, 7

work page 2023
[15]

Text2video-zero: Text- to-image diffusion models are zero-shot video generators

Levon Khachatryan, Andranik Movsisyan, Vahram Tade- vosyan, Roberto Henschel, Zhangyang Wang, Shant Navasardyan, and Humphrey Shi. Text2video-zero: Text- to-image diffusion models are zero-shot video generators. In Proceedings of the International Conference on Computer Vision (ICCV), 2023. 2

work page 2023
[16]

Semantic-sam: Segment and recognize anything at any granu- larity

Feng Li, Hao Zhang, Peize Sun, Xueyan Zou, Shilong Liu, Jianwei Yang, Chunyuan Li, Lei Zhang, and Jianfeng Gao. Semantic-sam: Segment and recognize anything at any granu- larity. arXiv preprint arXiv:2307.04767, 2023. 5

work page arXiv 2023
[17]

Blip- 2: Bootstrapping language-image pre-training with frozen image encoders and large language models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip- 2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In International conference on machine learning, pages 19730–19742. PMLR,

work page
[18]

Selma: Learning and merging skill-specific text- to-image experts with auto-generated data

Jialu Li, Jaemin Cho, Yi-Lin Sung, Jaehong Yoon, and Mohit Bansal. Selma: Learning and merging skill-specific text- to-image experts with auto-generated data. In Advances in Neural Information Processing Systems (NeurIPS), 2024. 3

work page 2024
[19]

T2v-turbo: Break- ing the quality bottleneck of video consistency model with mixed reward feedback

Jiachen Li, Weixi Feng, Tsu-Jui Fu, Xinyi Wang, Sugato Basu, Wenhu Chen, and William Yang Wang. T2v-turbo: Break- ing the quality bottleneck of video consistency model with mixed reward feedback. In Advances in Neural Information Processing Systems (NeurIPS), 2024. 2, 7, 13

work page 2024
[20]

Gligen: Open-set grounded text-to-image generation

Yuheng Li, Haotian Liu, Qingyang Wu, Fangzhou Mu, Jian- wei Yang, Jianfeng Gao, Chunyuan Li, and Yong Jae Lee. Gligen: Open-set grounded text-to-image generation. In Pro- ceedings of the IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), 2023. 2, 3, 5, 12

work page 2023
[21]

Llm-grounded diffusion: Enhancing prompt understanding of text-to-image diffusion models with large language models,

Long Lian, Boyi Li, Adam Yala, and Trevor Darrell. Llm- grounded diffusion: Enhancing prompt understanding of text- to-image diffusion models with large language models. arXiv preprint arXiv:2305.13655, 2023. 7

work page arXiv 2023
[22]

Llm-grounded video diffusion models

Long Lian, Baifeng Shi, Adam Yala, Trevor Darrell, and Boyi Li. Llm-grounded video diffusion models. In Proceedings of the International Conference on Learning Representations (ICLR), 2024. 3

work page 2024
[23]

Videodirectorgpt: Consistent multi-scene video generation via llm-guided planning

Han Lin, Abhay Zala, Jaemin Cho, and Mohit Bansal. Videodirectorgpt: Consistent multi-scene video generation via llm-guided planning. arXiv preprint arXiv:2309.15091,

work page arXiv
[24]

Improved baselines with visual instruction tuning

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. In Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26296–26306, 2024. 7

work page 2024
[25]

Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection

Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499, 2023. 7

work page internal anchor Pith review Pith/arXiv arXiv 2023
[26]

Evalcrafter: Benchmarking and eval- uating large video generation models

Yaofang Liu, Xiaodong Cun, Xuebo Liu, Xintao Wang, Yong Zhang, Haoxin Chen, Yang Liu, Tieyong Zeng, Raymond Chan, and Ying Shan. Evalcrafter: Benchmarking and eval- uating large video generation models. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), 2024. 2, 5, 6, 13

work page 2024
[27]

Videofusion: Decomposed diffusion models for high- quality video generation

Zhengxiong Luo, Dayou Chen, Yingya Zhang, Yan Huang, Liang Wang, Yujun Shen, Deli Zhao, Jingren Zhou, and Tie- niu Tan. Videofusion: Decomposed diffusion models for high- quality video generation. arXiv preprint arXiv:2303.08320,

work page arXiv
[28]

Gpt4motion: Scripting physical motions in text-to-video generation via blender-oriented gpt planning

Jiaxi Lv, Yi Huang, Mingfu Yan, Jiancheng Huang, Jianzhuang Liu, Yifan Liu, Yafei Wen, Xiaoxin Chen, and Shifeng Chen. Gpt4motion: Scripting physical motions in text-to-video generation via blender-oriented gpt planning. In Proceedings of the IEEE International Conference on Com- puter Vision and Pattern Recognition (CVPR), 2024. 3

work page 2024
[29]

Latte: Latent Diffusion Transformer for Video Generation

Xin Ma, Yaohui Wang, Gengyun Jia, Xinyuan Chen, Ziwei Liu, Yuan-Fang Li, Cunjian Chen, and Yu Qiao. Latte: Latent diffusion transformer for video generation. arXiv preprint arXiv:2401.03048, 2024. 6, 7

work page internal anchor Pith review Pith/arXiv arXiv 2024
[30]

Improving text-to- image consistency via automatic prompt optimization

Oscar Ma ˜nas, Pietro Astolfi, Melissa Hall, Candace Ross, Jack Urbanek, Adina Williams, Aishwarya Agrawal, Adriana Romero-Soriano, and Michal Drozdzal. Improving text-to- image consistency via automatic prompt optimization. arXiv preprint arXiv:2403.17804, 2024. 2, 3, 5, 6, 7, 12, 13, 14, 21

work page arXiv 2024
[31]

Guided image synthesis via initial image editing in diffusion model

Jiafeng Mao, Xueting Wang, and Kiyoharu Aizawa. Guided image synthesis via initial image editing in diffusion model. In Proceedings of the 31st ACM International Conference on Multimedia, pages 5321–5329, 2023. 2, 5

work page 2023
[32]

Hello gpt-4o, 2024

OpenAI. Hello gpt-4o, 2024. 2, 7

work page 2024
[33]

GPT-4 technical report, 2024

OpenAI. GPT-4 technical report, 2024. 7

work page 2024
[35]

Bleu: a method for automatic evaluation of machine translation

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318,

work page
[36]

Open-sora-plan, 2023

PKU-Yuan Lab and Tuzhan AI etc. Open-sora-plan, 2023. 6, 7

work page 2023
[37]

SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas M¨uller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952, 2023. 6

work page internal anchor Pith review Pith/arXiv arXiv 2023
[38]

Not all noises are created equally: Diffusion noise selection and optimization

Zipeng Qi, Lichen Bai, Haoyi Xiong, et al. Not all noises are created equally: Diffusion noise selection and optimization. arXiv preprint arXiv:2407.14041, 2024. 2, 5

work page arXiv 2024
[39]

Learning 10 transferable visual models from natural language supervi- sion

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning 10 transferable visual models from natural language supervi- sion. In International conference on machine learning, pages 8748–8763. PMLR, 2021. 6, 13

work page 2021
[40]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), 2022. 2

work page 2022
[41]

Improved techniques for training gans

Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans. Advances in neural information processing systems, 29, 2016. 6, 13

work page 2016
[42]

Make-A-Video: Text-to-Video Generation without Text-Video Data

Uriel Singer, Adam Polyak, Thomas Hayes, Xi Yin, Jie An, Songyang Zhang, Qiyuan Hu, Harry Yang, Oron Ashual, Oran Gafni, et al. Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:2209.14792,

work page internal anchor Pith review Pith/arXiv arXiv
[43]

Consistency Models

Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever. Consistency models. arXiv preprint arXiv:2303.01469, 2023. 2

work page internal anchor Pith review Pith/arXiv arXiv 2023
[44]

Dreamsync: Aligning text- to-image generation with image understanding feedback

Jiao Sun, Deqing Fu, Yushi Hu, Su Wang, Royi Rassin, Da-Cheng Juan, Dana Alon, Charles Herrmann, Sjoerd van Steenkiste, Ranjay Krishna, et al. Dreamsync: Aligning text- to-image generation with image understanding feedback. In Synthetic Data for Computer Vision Workshop@ CVPR 2024,

work page 2024
[45]

T2v-compbench: A comprehen- sive benchmark for compositional text-to-video generation

Kaiyue Sun, Kaiyi Huang, Xian Liu, Yue Wu, Zihan Xu, Zhenguo Li, and Xihui Liu. T2v-compbench: A comprehen- sive benchmark for compositional text-to-video generation. arXiv preprint arXiv:2407.14505, 2024. 2, 6, 13

work page arXiv 2024
[46]

Spatial- aware latent initialization for controllable image generation

Wenqiang Sun, Teng Li, Zehong Lin, and Jun Zhang. Spatial- aware latent initialization for controllable image generation. arXiv preprint arXiv:2401.16157, 2024. 2, 5

work page arXiv 2024
[47]

Raft: Recurrent all-pairs field transforms for optical flow

Zachary Teed and Jia Deng. Raft: Recurrent all-pairs field transforms for optical flow. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part II 16, pages 402–419. Springer, 2020. 6, 13

work page 2020
[48]

Videotetris: Towards compositional text-to-video generation

Ye Tian, Ling Yang, Haotian Yang, Yuan Gao, Yufan Deng, Jingmin Chen, Xintao Wang, Zhaochen Yu, Xin Tao, Pengfei Wan, et al. Videotetris: Towards compositional text-to-video generation. In Advances in Neural Information Processing Systems (NeurIPS), 2024. 6, 7

work page 2024
[49]

ModelScope Text-to-Video Technical Report

Jiuniu Wang, Hangjie Yuan, Dayou Chen, Yingya Zhang, Xiang Wang, and Shiwei Zhang. Modelscope text-to-video technical report. arXiv preprint arXiv:2308.06571, 2023. 2, 6, 7

work page internal anchor Pith review Pith/arXiv arXiv 2023
[50]

Videomae v2: Scaling video masked autoencoders with dual masking

Limin Wang, Bingkun Huang, Zhiyu Zhao, Zhan Tong, Yi- nan He, Yi Wang, Yali Wang, and Yu Qiao. Videomae v2: Scaling video masked autoencoders with dual masking. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14549–14560, 2023. 6, 13

work page 2023
[51]

Videolcm: Video latent consistency model, 2023

Xiang Wang, Shiwei Zhang, Han Zhang, Yu Liu, Yingya Zhang, Changxin Gao, and Nong Sang. Videolcm: Video latent consistency model, 2023. 2

work page 2023
[52]

A recipe for scaling up text-to-video generation with text-free videos

Xiang Wang, Shiwei Zhang, Hangjie Yuan, Zhiwu Qing, Biao Gong, Yingya Zhang, Yujun Shen, Changxin Gao, and Nong Sang. A recipe for scaling up text-to-video generation with text-free videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6572– 6582, 2024. 2

work page 2024
[53]

Exploring video quality assessment on user generated contents from aesthetic and technical perspectives

Haoning Wu, Erli Zhang, Liang Liao, Chaofeng Chen, Jing- wen Hou, Annan Wang, Wenxiu Sun, Qiong Yan, and Weisi Lin. Exploring video quality assessment on user generated contents from aesthetic and technical perspectives. In Pro- ceedings of the IEEE/CVF International Conference on Com- puter Vision, pages 20144–20154, 2023. 6, 13

work page 2023
[54]

Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation

Jay Zhangjie Wu, Yixiao Ge, Xintao Wang, Stan Weixian Lei, Yuchao Gu, Yufei Shi, Wynne Hsu, Ying Shan, Xiaohu Qie, and Mike Zheng Shou. Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation. In Proceedings of the International Conference on Computer Vision (ICCV), 2023. 2

work page 2023
[55]

Self-correcting llm-controlled diffusion models

Tsung-Han Wu, Long Lian, Joseph E Gonzalez, Boyi Li, and Trevor Darrell. Self-correcting llm-controlled diffusion models. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), 2024. 2, 3, 4, 6, 7, 12, 13, 14

work page 2024
[56]

Compositional video gener- ation as flow equalization

Xingyi Yang and Xinchao Wang. Compositional video gener- ation as flow equalization. arXiv preprint arXiv:2407.06182,

work page arXiv
[57]

CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiao- han Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer. arXiv preprint arXiv:2408.06072, 2024. 2, 7

work page internal anchor Pith review Pith/arXiv arXiv 2024
[58]

Show-1: Marrying pixel and latent diffusion models for text-to-video generation

David Junhao Zhang, Jay Zhangjie Wu, Jia-Wei Liu, Rui Zhao, Lingmin Ran, Yuchao Gu, Difei Gao, and Mike Zheng Shou. Show-1: Marrying pixel and latent diffusion models for text-to-video generation. International Journal of Computer Vision, pages 1–15, 2024. 6, 7

work page 2024
[59]

the words ‘KEEP OFF THE GRASS

Yabo Zhang, Yuxiang Wei, Dongsheng Jiang, Xiaopeng Zhang, Wangmeng Zuo, and Qi Tian. Controlvideo: Training- free controllable text-to-video generation. arXiv preprint arXiv:2305.13077, 2023. 2 11 Appendix A. V IDEO REPAIR Implementation Details 12 A.1. Question Generation . . . . . . . . . . . . . 12 A.2. Visual Question Answering . . . . . . . . . 12 A....

work page arXiv 2023
[60]

For ex- ample, video ranking is not applied when K = 1, and only one refinement is produced using a single random seed noise ϵ′

work page
[61]

As depicted in Fig

For ranking metrics, we rely on DSG Obj across all ab- lation studies. As depicted in Fig. 8, higher K values (5, 10, and 20) consistently yield higher scores across all cate- gories than K = 1. This trend is particularly prominent in the ‘count’ category, where increasingK leads to noticeable performance improvements, highlighting the importance of consi...

work page
[62]

{cur_question}

Given the question: "{cur_question}", provide a brief reasoning (up to two sentences) to determine the accurate answer

work page
[63]

Yes" and 0.0 for

Respond to the question using binary values: 1.0 for "Yes" and 0.0 for "No". If the answer is uncertain or unnatural due to image distortion or other issues, respond with 0.0 ("No")

work page
[64]

{key_objects}

Return the number of "{key_objects}" (as an integer) mentioned in the initial prompt "{cur_question}"

work page
[65]

{key_objects}

Return the number of "{key_objects}" (as an integer) in the provided image. Return the result as a dictionary in the following format (not in JSON format): {{"Q": "<question>", "A": <binary answer>, "reasoning": "<brief reasoning>", "obj_in_prompt": <number of key object mentioned in the initial prompt>, "obj_in_img": <number of key object in the image>}}...

work page

[1] [1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023. 2

work page internal anchor Pith review Pith/arXiv arXiv 2023

[2] [2]

The crystal ball hypoth- esis in diffusion models: Anticipating object positions from initial noise

Yuanhao Ban, Ruochen Wang, Tianyi Zhou, Boqing Gong, Cho-Jui Hsieh, and Minhao Cheng. The crystal ball hypoth- esis in diffusion models: Anticipating object positions from initial noise. arXiv preprint arXiv:2406.01970, 2024. 2, 5

work page arXiv 2024

[3] [3]

Multidiffusion: Fusing diffusion paths for controlled image generation

Omer Bar-Tal, Lior Yariv, Yaron Lipman, and Tali Dekel. Multidiffusion: Fusing diffusion paths for controlled image generation. In Proceedings of the International Conference on Machine Learning (ICML), 2023. 5

work page 2023

[4] [4]

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127, 2023. 2

work page internal anchor Pith review Pith/arXiv arXiv 2023

[5] [5]

Videocrafter2: Overcoming data limitations for high-quality video diffusion models

Haoxin Chen, Yong Zhang, Xiaodong Cun, Menghan Xia, Xintao Wang, Chao Weng, and Ying Shan. Videocrafter2: Overcoming data limitations for high-quality video diffusion models. arXiv preprint arXiv:2401.09047, 2024. 2, 7

work page arXiv 2024

[6] [6]

arXiv preprint arXiv:2305.06558 (2023)

Yangming Cheng, Liulei Li, Yuanyou Xu, Xiaodi Li, Zongxin Yang, Wenguan Wang, and Yi Yang. Segment and track anything. arXiv preprint arXiv:2305.06558, 2023. 6

work page arXiv 2023

[7] [7]

Davidsonian scene graph: Improving relia- bility in fine-grained evaluation for text-image generation

Jaemin Cho, Yushi Hu, Roopal Garg, Peter Anderson, Ranjay Krishna, Jason Baldridge, Mohit Bansal, Jordi Pont-Tuset, and Su Wang. Davidsonian scene graph: Improving relia- bility in fine-grained evaluation for text-image generation. In Proceedings of the International Conference on Learning Representations (ICLR), 2024. 2, 3, 7, 12

work page 2024

[8] [8]

Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models

Matt Deitke, Christopher Clark, Sangho Lee, Rohun Tripathi, Yue Yang, Jae Sung Park, Mohammadreza Salehi, Niklas Muennighoff, Kyle Lo, Luca Soldaini, et al. Molmo and pixmo: Open weights and open data for state-of-the-art multi- modal models. arXiv preprint arXiv:2409.17146, 2024. 5

work page internal anchor Pith review Pith/arXiv arXiv 2024

[9] [9]

Structure and content-guided video synthesis with diffusion models

Patrick Esser, Johnathan Chiu, Parmida Atighehchian, Jonathan Granskog, and Anastasis Germanidis. Structure and content-guided video synthesis with diffusion models. In Proceedings of the International Conference on Computer Vision (ICCV), 2023. 2

work page 2023

[10] [10]

CLIPScore: A reference-free evaluation metric for image captioning

Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. CLIPScore: A reference-free evaluation metric for image captioning. In Proceedings of the 2021 Con- ference on Empirical Methods in Natural Language Process- ing, pages 7514–7528, Online and Punta Cana, Dominican Republic, 2021. Association for Computational Linguistics. 8

work page 2021

[11] [11]

Denoising diffu- sion probabilistic models

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffu- sion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020. 2 9

work page 2020

[12] [12]

Video diffusion mod- els

Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J Fleet. Video diffusion mod- els. In Advances in Neural Information Processing Systems (NeurIPS), 2022. 2

work page 2022

[13] [13]

CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers

Wenyi Hong, Ming Ding, Wendi Zheng, Xinghan Liu, and Jie Tang. Cogvideo: Large-scale pretraining for text-to-video gen- eration via transformers. arXiv preprint arXiv:2205.15868,

work page internal anchor Pith review Pith/arXiv arXiv

[14] [14]

Zeroscope, 2023

huggingface. Zeroscope, 2023. 6, 7

work page 2023

[15] [15]

Text2video-zero: Text- to-image diffusion models are zero-shot video generators

Levon Khachatryan, Andranik Movsisyan, Vahram Tade- vosyan, Roberto Henschel, Zhangyang Wang, Shant Navasardyan, and Humphrey Shi. Text2video-zero: Text- to-image diffusion models are zero-shot video generators. In Proceedings of the International Conference on Computer Vision (ICCV), 2023. 2

work page 2023

[16] [16]

Semantic-sam: Segment and recognize anything at any granu- larity

Feng Li, Hao Zhang, Peize Sun, Xueyan Zou, Shilong Liu, Jianwei Yang, Chunyuan Li, Lei Zhang, and Jianfeng Gao. Semantic-sam: Segment and recognize anything at any granu- larity. arXiv preprint arXiv:2307.04767, 2023. 5

work page arXiv 2023

[17] [17]

Blip- 2: Bootstrapping language-image pre-training with frozen image encoders and large language models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip- 2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In International conference on machine learning, pages 19730–19742. PMLR,

work page

[18] [18]

Selma: Learning and merging skill-specific text- to-image experts with auto-generated data

Jialu Li, Jaemin Cho, Yi-Lin Sung, Jaehong Yoon, and Mohit Bansal. Selma: Learning and merging skill-specific text- to-image experts with auto-generated data. In Advances in Neural Information Processing Systems (NeurIPS), 2024. 3

work page 2024

[19] [19]

T2v-turbo: Break- ing the quality bottleneck of video consistency model with mixed reward feedback

Jiachen Li, Weixi Feng, Tsu-Jui Fu, Xinyi Wang, Sugato Basu, Wenhu Chen, and William Yang Wang. T2v-turbo: Break- ing the quality bottleneck of video consistency model with mixed reward feedback. In Advances in Neural Information Processing Systems (NeurIPS), 2024. 2, 7, 13

work page 2024

[20] [20]

Gligen: Open-set grounded text-to-image generation

Yuheng Li, Haotian Liu, Qingyang Wu, Fangzhou Mu, Jian- wei Yang, Jianfeng Gao, Chunyuan Li, and Yong Jae Lee. Gligen: Open-set grounded text-to-image generation. In Pro- ceedings of the IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), 2023. 2, 3, 5, 12

work page 2023

[21] [21]

Llm-grounded diffusion: Enhancing prompt understanding of text-to-image diffusion models with large language models,

Long Lian, Boyi Li, Adam Yala, and Trevor Darrell. Llm- grounded diffusion: Enhancing prompt understanding of text- to-image diffusion models with large language models. arXiv preprint arXiv:2305.13655, 2023. 7

work page arXiv 2023

[22] [22]

Llm-grounded video diffusion models

Long Lian, Baifeng Shi, Adam Yala, Trevor Darrell, and Boyi Li. Llm-grounded video diffusion models. In Proceedings of the International Conference on Learning Representations (ICLR), 2024. 3

work page 2024

[23] [23]

Videodirectorgpt: Consistent multi-scene video generation via llm-guided planning

Han Lin, Abhay Zala, Jaemin Cho, and Mohit Bansal. Videodirectorgpt: Consistent multi-scene video generation via llm-guided planning. arXiv preprint arXiv:2309.15091,

work page arXiv

[24] [24]

Improved baselines with visual instruction tuning

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. In Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26296–26306, 2024. 7

work page 2024

[25] [25]

Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection

Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499, 2023. 7

work page internal anchor Pith review Pith/arXiv arXiv 2023

[26] [26]

Evalcrafter: Benchmarking and eval- uating large video generation models

Yaofang Liu, Xiaodong Cun, Xuebo Liu, Xintao Wang, Yong Zhang, Haoxin Chen, Yang Liu, Tieyong Zeng, Raymond Chan, and Ying Shan. Evalcrafter: Benchmarking and eval- uating large video generation models. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), 2024. 2, 5, 6, 13

work page 2024

[27] [27]

Videofusion: Decomposed diffusion models for high- quality video generation

Zhengxiong Luo, Dayou Chen, Yingya Zhang, Yan Huang, Liang Wang, Yujun Shen, Deli Zhao, Jingren Zhou, and Tie- niu Tan. Videofusion: Decomposed diffusion models for high- quality video generation. arXiv preprint arXiv:2303.08320,

work page arXiv

[28] [28]

Gpt4motion: Scripting physical motions in text-to-video generation via blender-oriented gpt planning

Jiaxi Lv, Yi Huang, Mingfu Yan, Jiancheng Huang, Jianzhuang Liu, Yifan Liu, Yafei Wen, Xiaoxin Chen, and Shifeng Chen. Gpt4motion: Scripting physical motions in text-to-video generation via blender-oriented gpt planning. In Proceedings of the IEEE International Conference on Com- puter Vision and Pattern Recognition (CVPR), 2024. 3

work page 2024

[29] [29]

Latte: Latent Diffusion Transformer for Video Generation

Xin Ma, Yaohui Wang, Gengyun Jia, Xinyuan Chen, Ziwei Liu, Yuan-Fang Li, Cunjian Chen, and Yu Qiao. Latte: Latent diffusion transformer for video generation. arXiv preprint arXiv:2401.03048, 2024. 6, 7

work page internal anchor Pith review Pith/arXiv arXiv 2024

[30] [30]

Improving text-to- image consistency via automatic prompt optimization

Oscar Ma ˜nas, Pietro Astolfi, Melissa Hall, Candace Ross, Jack Urbanek, Adina Williams, Aishwarya Agrawal, Adriana Romero-Soriano, and Michal Drozdzal. Improving text-to- image consistency via automatic prompt optimization. arXiv preprint arXiv:2403.17804, 2024. 2, 3, 5, 6, 7, 12, 13, 14, 21

work page arXiv 2024

[31] [31]

Guided image synthesis via initial image editing in diffusion model

Jiafeng Mao, Xueting Wang, and Kiyoharu Aizawa. Guided image synthesis via initial image editing in diffusion model. In Proceedings of the 31st ACM International Conference on Multimedia, pages 5321–5329, 2023. 2, 5

work page 2023

[32] [32]

Hello gpt-4o, 2024

OpenAI. Hello gpt-4o, 2024. 2, 7

work page 2024

[33] [33]

GPT-4 technical report, 2024

OpenAI. GPT-4 technical report, 2024. 7

work page 2024

[34] [35]

Bleu: a method for automatic evaluation of machine translation

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318,

work page

[35] [36]

Open-sora-plan, 2023

PKU-Yuan Lab and Tuzhan AI etc. Open-sora-plan, 2023. 6, 7

work page 2023

[36] [37]

SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas M¨uller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952, 2023. 6

work page internal anchor Pith review Pith/arXiv arXiv 2023

[37] [38]

Not all noises are created equally: Diffusion noise selection and optimization

Zipeng Qi, Lichen Bai, Haoyi Xiong, et al. Not all noises are created equally: Diffusion noise selection and optimization. arXiv preprint arXiv:2407.14041, 2024. 2, 5

work page arXiv 2024

[38] [39]

Learning 10 transferable visual models from natural language supervi- sion

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning 10 transferable visual models from natural language supervi- sion. In International conference on machine learning, pages 8748–8763. PMLR, 2021. 6, 13

work page 2021

[39] [40]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), 2022. 2

work page 2022

[40] [41]

Improved techniques for training gans

Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans. Advances in neural information processing systems, 29, 2016. 6, 13

work page 2016

[41] [42]

Make-A-Video: Text-to-Video Generation without Text-Video Data

Uriel Singer, Adam Polyak, Thomas Hayes, Xi Yin, Jie An, Songyang Zhang, Qiyuan Hu, Harry Yang, Oron Ashual, Oran Gafni, et al. Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:2209.14792,

work page internal anchor Pith review Pith/arXiv arXiv

[42] [43]

Consistency Models

Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever. Consistency models. arXiv preprint arXiv:2303.01469, 2023. 2

work page internal anchor Pith review Pith/arXiv arXiv 2023

[43] [44]

Dreamsync: Aligning text- to-image generation with image understanding feedback

Jiao Sun, Deqing Fu, Yushi Hu, Su Wang, Royi Rassin, Da-Cheng Juan, Dana Alon, Charles Herrmann, Sjoerd van Steenkiste, Ranjay Krishna, et al. Dreamsync: Aligning text- to-image generation with image understanding feedback. In Synthetic Data for Computer Vision Workshop@ CVPR 2024,

work page 2024

[44] [45]

T2v-compbench: A comprehen- sive benchmark for compositional text-to-video generation

Kaiyue Sun, Kaiyi Huang, Xian Liu, Yue Wu, Zihan Xu, Zhenguo Li, and Xihui Liu. T2v-compbench: A comprehen- sive benchmark for compositional text-to-video generation. arXiv preprint arXiv:2407.14505, 2024. 2, 6, 13

work page arXiv 2024

[45] [46]

Spatial- aware latent initialization for controllable image generation

Wenqiang Sun, Teng Li, Zehong Lin, and Jun Zhang. Spatial- aware latent initialization for controllable image generation. arXiv preprint arXiv:2401.16157, 2024. 2, 5

work page arXiv 2024

[46] [47]

Raft: Recurrent all-pairs field transforms for optical flow

Zachary Teed and Jia Deng. Raft: Recurrent all-pairs field transforms for optical flow. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part II 16, pages 402–419. Springer, 2020. 6, 13

work page 2020

[47] [48]

Videotetris: Towards compositional text-to-video generation

Ye Tian, Ling Yang, Haotian Yang, Yuan Gao, Yufan Deng, Jingmin Chen, Xintao Wang, Zhaochen Yu, Xin Tao, Pengfei Wan, et al. Videotetris: Towards compositional text-to-video generation. In Advances in Neural Information Processing Systems (NeurIPS), 2024. 6, 7

work page 2024

[48] [49]

ModelScope Text-to-Video Technical Report

Jiuniu Wang, Hangjie Yuan, Dayou Chen, Yingya Zhang, Xiang Wang, and Shiwei Zhang. Modelscope text-to-video technical report. arXiv preprint arXiv:2308.06571, 2023. 2, 6, 7

work page internal anchor Pith review Pith/arXiv arXiv 2023

[49] [50]

Videomae v2: Scaling video masked autoencoders with dual masking

Limin Wang, Bingkun Huang, Zhiyu Zhao, Zhan Tong, Yi- nan He, Yi Wang, Yali Wang, and Yu Qiao. Videomae v2: Scaling video masked autoencoders with dual masking. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14549–14560, 2023. 6, 13

work page 2023

[50] [51]

Videolcm: Video latent consistency model, 2023

Xiang Wang, Shiwei Zhang, Han Zhang, Yu Liu, Yingya Zhang, Changxin Gao, and Nong Sang. Videolcm: Video latent consistency model, 2023. 2

work page 2023

[51] [52]

A recipe for scaling up text-to-video generation with text-free videos

Xiang Wang, Shiwei Zhang, Hangjie Yuan, Zhiwu Qing, Biao Gong, Yingya Zhang, Yujun Shen, Changxin Gao, and Nong Sang. A recipe for scaling up text-to-video generation with text-free videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6572– 6582, 2024. 2

work page 2024

[52] [53]

Exploring video quality assessment on user generated contents from aesthetic and technical perspectives

Haoning Wu, Erli Zhang, Liang Liao, Chaofeng Chen, Jing- wen Hou, Annan Wang, Wenxiu Sun, Qiong Yan, and Weisi Lin. Exploring video quality assessment on user generated contents from aesthetic and technical perspectives. In Pro- ceedings of the IEEE/CVF International Conference on Com- puter Vision, pages 20144–20154, 2023. 6, 13

work page 2023

[53] [54]

Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation

Jay Zhangjie Wu, Yixiao Ge, Xintao Wang, Stan Weixian Lei, Yuchao Gu, Yufei Shi, Wynne Hsu, Ying Shan, Xiaohu Qie, and Mike Zheng Shou. Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation. In Proceedings of the International Conference on Computer Vision (ICCV), 2023. 2

work page 2023

[54] [55]

Self-correcting llm-controlled diffusion models

Tsung-Han Wu, Long Lian, Joseph E Gonzalez, Boyi Li, and Trevor Darrell. Self-correcting llm-controlled diffusion models. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), 2024. 2, 3, 4, 6, 7, 12, 13, 14

work page 2024

[55] [56]

Compositional video gener- ation as flow equalization

Xingyi Yang and Xinchao Wang. Compositional video gener- ation as flow equalization. arXiv preprint arXiv:2407.06182,

work page arXiv

[56] [57]

CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiao- han Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer. arXiv preprint arXiv:2408.06072, 2024. 2, 7

work page internal anchor Pith review Pith/arXiv arXiv 2024

[57] [58]

Show-1: Marrying pixel and latent diffusion models for text-to-video generation

David Junhao Zhang, Jay Zhangjie Wu, Jia-Wei Liu, Rui Zhao, Lingmin Ran, Yuchao Gu, Difei Gao, and Mike Zheng Shou. Show-1: Marrying pixel and latent diffusion models for text-to-video generation. International Journal of Computer Vision, pages 1–15, 2024. 6, 7

work page 2024

[58] [59]

the words ‘KEEP OFF THE GRASS

Yabo Zhang, Yuxiang Wei, Dongsheng Jiang, Xiaopeng Zhang, Wangmeng Zuo, and Qi Tian. Controlvideo: Training- free controllable text-to-video generation. arXiv preprint arXiv:2305.13077, 2023. 2 11 Appendix A. V IDEO REPAIR Implementation Details 12 A.1. Question Generation . . . . . . . . . . . . . 12 A.2. Visual Question Answering . . . . . . . . . 12 A....

work page arXiv 2023

[59] [60]

For ex- ample, video ranking is not applied when K = 1, and only one refinement is produced using a single random seed noise ϵ′

work page

[60] [61]

As depicted in Fig

For ranking metrics, we rely on DSG Obj across all ab- lation studies. As depicted in Fig. 8, higher K values (5, 10, and 20) consistently yield higher scores across all cate- gories than K = 1. This trend is particularly prominent in the ‘count’ category, where increasingK leads to noticeable performance improvements, highlighting the importance of consi...

work page

[61] [62]

{cur_question}

Given the question: "{cur_question}", provide a brief reasoning (up to two sentences) to determine the accurate answer

work page

[62] [63]

Yes" and 0.0 for

Respond to the question using binary values: 1.0 for "Yes" and 0.0 for "No". If the answer is uncertain or unnatural due to image distortion or other issues, respond with 0.0 ("No")

work page

[63] [64]

{key_objects}

Return the number of "{key_objects}" (as an integer) mentioned in the initial prompt "{cur_question}"

work page

[64] [65]

{key_objects}

Return the number of "{key_objects}" (as an integer) in the provided image. Return the result as a dictionary in the following format (not in JSON format): {{"Q": "<question>", "A": <binary answer>, "reasoning": "<brief reasoning>", "obj_in_prompt": <number of key object mentioned in the initial prompt>, "obj_in_img": <number of key object in the image>}}...

work page