Self-Correcting Text-to-Video Generation with Misalignment Detection and Localized Refinement
Pith reviewed 2026-05-23 08:23 UTC · model grok-4.3
The pith
VideoRepair detects fine-grained text-video misalignments and performs targeted local refinements while preserving correct regions.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors claim that a three-stage process—MLLM-driven misalignment detection with automatically generated questions, refinement planning that segments and preserves correct entities across frames, and joint optimization for localized regeneration—enables self-correction of text-to-video outputs. This process is model-agnostic and training-free, and experiments on EvalCrafter and T2V-CompBench demonstrate substantial gains in alignment metrics over recent baselines.
What carries the argument
Region-preserving refinement strategy with misalignment detection via MLLM, refinement planning, and localized refinement.
If this is right
- The method improves alignment metrics across diverse prompts on EvalCrafter and T2V-CompBench.
- It works without retraining when applied to four different recent T2V diffusion backbones.
- It reduces unnecessary changes by keeping correctly generated regions intact during refinement.
- Ablations confirm the framework remains efficient and produces interpretable correction steps.
Where Pith is reading between the lines
- The same detection-plus-local-fix pattern could be tested on text-to-image models that also struggle with complex attribute binding.
- If the MLLM detector generalizes, it might serve as an automatic quality filter before any refinement step.
- The preservation of correct regions suggests a route to lower inference cost compared with full video regeneration.
Load-bearing premise
The MLLM-based evaluation with automatically generated questions reliably identifies which regions of the video are misaligned with the text prompt.
What would settle it
Running the detection stage on a set of videos with known, human-verified misalignments and finding that the MLLM fails to flag the actual mismatched regions at rates above random selection.
Figures
read the original abstract
Recent text-to-video (T2V) diffusion models have made remarkable progress in generating high-quality videos. However, they often struggle to align with complex text prompts, particularly when multiple objects, attributes, or spatial relations are specified. We introduce VideoRepair, the first self-correcting, training-free, and model-agnostic video refinement framework that automatically detects fine-grained text-video misalignments and performs targeted, localized corrections. Our key insight is that even misaligned videos usually contain correctly generated regions that should be preserved rather than regenerated. Building on this observation, VideoRepair proposes a novel region-preserving refinement strategy with three stages: (i) misalignment detection, where MLLM-based evaluation with automatically generated evaluation questions identifies misaligned regions; (ii) refinement planning, which preserves correctly generated entities, segments their regions across frames, and constructs targeted prompts for misaligned areas; and (iii) localized refinement, which selectively regenerates problematic regions while preserving faithful content through joint optimization of preserved and newly generated areas. On two benchmarks, EvalCrafter and T2V-CompBench with four recent T2V backbones, VideoRepair achieves substantial improvements over recent baselines across diverse alignment metrics. Comprehensive ablations further demonstrate the efficiency, robustness, and interpretability of our framework.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces VideoRepair, a training-free and model-agnostic framework for refining text-to-video (T2V) outputs. It consists of three stages: (i) MLLM-based misalignment detection via automatically generated questions to identify misaligned regions, (ii) refinement planning that preserves correct entities and segments regions, and (iii) localized refinement that regenerates only problematic areas through joint optimization. The central claim is that this yields substantial improvements over baselines on EvalCrafter and T2V-CompBench across four recent T2V backbones, supported by ablations on efficiency and robustness.
Significance. If the misalignment detection proves reliable and the gains are attributable to targeted preservation rather than generic resampling, the work could meaningfully advance training-free post-processing for T2V alignment, particularly for complex multi-object prompts. The model-agnostic design and emphasis on region preservation are practical strengths.
major comments (2)
- [misalignment detection] Misalignment detection section: No quantitative validation metrics (precision, recall, IoU, or F1) are reported for the MLLM-based detector against human-annotated ground truth on fine-grained errors (objects, attributes, relations). This is load-bearing for the central claim, as improvements on the benchmarks cannot be confidently attributed to precise localized correction without evidence that detection errors are rare.
- [experiments / ablations] Experiments section (ablation studies): While comprehensive ablations are mentioned, the contribution of the detection stage versus the planning and refinement stages is not isolated with controlled variants (e.g., random region selection or full regeneration baselines). This leaves open whether the reported gains on EvalCrafter and T2V-CompBench stem specifically from the self-correcting pipeline.
minor comments (2)
- [abstract] The abstract states 'substantial improvements' but provides no numerical values, error bars, or baseline details; these should be summarized with key metrics in the abstract for clarity.
- [localized refinement] Notation for region segmentation across frames and the joint optimization objective in the localized refinement stage could be formalized with equations to improve reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript accordingly to strengthen the validation of our claims.
read point-by-point responses
-
Referee: Misalignment detection section: No quantitative validation metrics (precision, recall, IoU, or F1) are reported for the MLLM-based detector against human-annotated ground truth on fine-grained errors (objects, attributes, relations). This is load-bearing for the central claim, as improvements on the benchmarks cannot be confidently attributed to precise localized correction without evidence that detection errors are rare.
Authors: We acknowledge that the current manuscript lacks direct quantitative metrics (e.g., precision, recall, F1) for the MLLM-based misalignment detector evaluated against human-annotated ground truth on fine-grained errors. The end-to-end gains on EvalCrafter and T2V-CompBench with four backbones, combined with robustness ablations, provide indirect support, but we agree this does not fully isolate detection reliability. We will add a dedicated human evaluation subsection reporting precision, recall, IoU, and F1 on a sampled set of videos with fine-grained annotations. revision: yes
-
Referee: Experiments section (ablation studies): While comprehensive ablations are mentioned, the contribution of the detection stage versus the planning and refinement stages is not isolated with controlled variants (e.g., random region selection or full regeneration baselines). This leaves open whether the reported gains on EvalCrafter and T2V-CompBench stem specifically from the self-correcting pipeline.
Authors: We agree that controlled variants isolating the detection stage (such as random region selection or full regeneration) are needed to attribute gains specifically to the pipeline. Our existing ablations cover component removals and efficiency, but do not include these exact baselines. We will add these experiments in the revised version, comparing VideoRepair against random-region and full-regeneration controls on the same benchmarks and backbones. revision: yes
Circularity Check
No circularity: empirical framework with no derivation chain
full rationale
The paper presents a training-free, model-agnostic refinement pipeline evaluated empirically on EvalCrafter and T2V-CompBench. No equations, fitted parameters, predictions derived from inputs, or self-citations are described as load-bearing for the central claims. The three stages (misalignment detection via MLLM, refinement planning, localized refinement) are introduced as novel components without reducing to prior fitted values or self-referential definitions. The reported improvements are benchmark results, not outputs forced by construction from the method's own inputs. This is the expected non-finding for an applied empirical method paper.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 1 Pith paper
-
MotiMotion: Motion-Controlled Video Generation with Visual Reasoning
MotiMotion adds visual reasoning via a training-free VLM to refine primary trajectories and hallucinate secondary motions, plus a confidence-aware guidance scheme, yielding more plausible interactions on the new MotiB...
Reference graph
Works this paper leans on
-
[1]
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023. 2
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[2]
The crystal ball hypoth- esis in diffusion models: Anticipating object positions from initial noise
Yuanhao Ban, Ruochen Wang, Tianyi Zhou, Boqing Gong, Cho-Jui Hsieh, and Minhao Cheng. The crystal ball hypoth- esis in diffusion models: Anticipating object positions from initial noise. arXiv preprint arXiv:2406.01970, 2024. 2, 5
-
[3]
Multidiffusion: Fusing diffusion paths for controlled image generation
Omer Bar-Tal, Lior Yariv, Yaron Lipman, and Tali Dekel. Multidiffusion: Fusing diffusion paths for controlled image generation. In Proceedings of the International Conference on Machine Learning (ICML), 2023. 5
work page 2023
-
[4]
Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets
Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127, 2023. 2
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[5]
Videocrafter2: Overcoming data limitations for high-quality video diffusion models
Haoxin Chen, Yong Zhang, Xiaodong Cun, Menghan Xia, Xintao Wang, Chao Weng, and Ying Shan. Videocrafter2: Overcoming data limitations for high-quality video diffusion models. arXiv preprint arXiv:2401.09047, 2024. 2, 7
-
[6]
arXiv preprint arXiv:2305.06558 (2023)
Yangming Cheng, Liulei Li, Yuanyou Xu, Xiaodi Li, Zongxin Yang, Wenguan Wang, and Yi Yang. Segment and track anything. arXiv preprint arXiv:2305.06558, 2023. 6
-
[7]
Jaemin Cho, Yushi Hu, Roopal Garg, Peter Anderson, Ranjay Krishna, Jason Baldridge, Mohit Bansal, Jordi Pont-Tuset, and Su Wang. Davidsonian scene graph: Improving relia- bility in fine-grained evaluation for text-image generation. In Proceedings of the International Conference on Learning Representations (ICLR), 2024. 2, 3, 7, 12
work page 2024
-
[8]
Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models
Matt Deitke, Christopher Clark, Sangho Lee, Rohun Tripathi, Yue Yang, Jae Sung Park, Mohammadreza Salehi, Niklas Muennighoff, Kyle Lo, Luca Soldaini, et al. Molmo and pixmo: Open weights and open data for state-of-the-art multi- modal models. arXiv preprint arXiv:2409.17146, 2024. 5
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[9]
Structure and content-guided video synthesis with diffusion models
Patrick Esser, Johnathan Chiu, Parmida Atighehchian, Jonathan Granskog, and Anastasis Germanidis. Structure and content-guided video synthesis with diffusion models. In Proceedings of the International Conference on Computer Vision (ICCV), 2023. 2
work page 2023
-
[10]
CLIPScore: A reference-free evaluation metric for image captioning
Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. CLIPScore: A reference-free evaluation metric for image captioning. In Proceedings of the 2021 Con- ference on Empirical Methods in Natural Language Process- ing, pages 7514–7528, Online and Punta Cana, Dominican Republic, 2021. Association for Computational Linguistics. 8
work page 2021
-
[11]
Denoising diffu- sion probabilistic models
Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffu- sion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020. 2 9
work page 2020
-
[12]
Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J Fleet. Video diffusion mod- els. In Advances in Neural Information Processing Systems (NeurIPS), 2022. 2
work page 2022
-
[13]
CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers
Wenyi Hong, Ming Ding, Wendi Zheng, Xinghan Liu, and Jie Tang. Cogvideo: Large-scale pretraining for text-to-video gen- eration via transformers. arXiv preprint arXiv:2205.15868,
work page internal anchor Pith review Pith/arXiv arXiv
- [14]
-
[15]
Text2video-zero: Text- to-image diffusion models are zero-shot video generators
Levon Khachatryan, Andranik Movsisyan, Vahram Tade- vosyan, Roberto Henschel, Zhangyang Wang, Shant Navasardyan, and Humphrey Shi. Text2video-zero: Text- to-image diffusion models are zero-shot video generators. In Proceedings of the International Conference on Computer Vision (ICCV), 2023. 2
work page 2023
-
[16]
Semantic-sam: Segment and recognize anything at any granu- larity
Feng Li, Hao Zhang, Peize Sun, Xueyan Zou, Shilong Liu, Jianwei Yang, Chunyuan Li, Lei Zhang, and Jianfeng Gao. Semantic-sam: Segment and recognize anything at any granu- larity. arXiv preprint arXiv:2307.04767, 2023. 5
-
[17]
Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip- 2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In International conference on machine learning, pages 19730–19742. PMLR,
-
[18]
Selma: Learning and merging skill-specific text- to-image experts with auto-generated data
Jialu Li, Jaemin Cho, Yi-Lin Sung, Jaehong Yoon, and Mohit Bansal. Selma: Learning and merging skill-specific text- to-image experts with auto-generated data. In Advances in Neural Information Processing Systems (NeurIPS), 2024. 3
work page 2024
-
[19]
T2v-turbo: Break- ing the quality bottleneck of video consistency model with mixed reward feedback
Jiachen Li, Weixi Feng, Tsu-Jui Fu, Xinyi Wang, Sugato Basu, Wenhu Chen, and William Yang Wang. T2v-turbo: Break- ing the quality bottleneck of video consistency model with mixed reward feedback. In Advances in Neural Information Processing Systems (NeurIPS), 2024. 2, 7, 13
work page 2024
-
[20]
Gligen: Open-set grounded text-to-image generation
Yuheng Li, Haotian Liu, Qingyang Wu, Fangzhou Mu, Jian- wei Yang, Jianfeng Gao, Chunyuan Li, and Yong Jae Lee. Gligen: Open-set grounded text-to-image generation. In Pro- ceedings of the IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), 2023. 2, 3, 5, 12
work page 2023
-
[21]
Long Lian, Boyi Li, Adam Yala, and Trevor Darrell. Llm- grounded diffusion: Enhancing prompt understanding of text- to-image diffusion models with large language models. arXiv preprint arXiv:2305.13655, 2023. 7
-
[22]
Llm-grounded video diffusion models
Long Lian, Baifeng Shi, Adam Yala, Trevor Darrell, and Boyi Li. Llm-grounded video diffusion models. In Proceedings of the International Conference on Learning Representations (ICLR), 2024. 3
work page 2024
-
[23]
Videodirectorgpt: Consistent multi-scene video generation via llm-guided planning
Han Lin, Abhay Zala, Jaemin Cho, and Mohit Bansal. Videodirectorgpt: Consistent multi-scene video generation via llm-guided planning. arXiv preprint arXiv:2309.15091,
-
[24]
Improved baselines with visual instruction tuning
Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. In Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26296–26306, 2024. 7
work page 2024
-
[25]
Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection
Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499, 2023. 7
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[26]
Evalcrafter: Benchmarking and eval- uating large video generation models
Yaofang Liu, Xiaodong Cun, Xuebo Liu, Xintao Wang, Yong Zhang, Haoxin Chen, Yang Liu, Tieyong Zeng, Raymond Chan, and Ying Shan. Evalcrafter: Benchmarking and eval- uating large video generation models. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), 2024. 2, 5, 6, 13
work page 2024
-
[27]
Videofusion: Decomposed diffusion models for high- quality video generation
Zhengxiong Luo, Dayou Chen, Yingya Zhang, Yan Huang, Liang Wang, Yujun Shen, Deli Zhao, Jingren Zhou, and Tie- niu Tan. Videofusion: Decomposed diffusion models for high- quality video generation. arXiv preprint arXiv:2303.08320,
-
[28]
Gpt4motion: Scripting physical motions in text-to-video generation via blender-oriented gpt planning
Jiaxi Lv, Yi Huang, Mingfu Yan, Jiancheng Huang, Jianzhuang Liu, Yifan Liu, Yafei Wen, Xiaoxin Chen, and Shifeng Chen. Gpt4motion: Scripting physical motions in text-to-video generation via blender-oriented gpt planning. In Proceedings of the IEEE International Conference on Com- puter Vision and Pattern Recognition (CVPR), 2024. 3
work page 2024
-
[29]
Latte: Latent Diffusion Transformer for Video Generation
Xin Ma, Yaohui Wang, Gengyun Jia, Xinyuan Chen, Ziwei Liu, Yuan-Fang Li, Cunjian Chen, and Yu Qiao. Latte: Latent diffusion transformer for video generation. arXiv preprint arXiv:2401.03048, 2024. 6, 7
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[30]
Improving text-to- image consistency via automatic prompt optimization
Oscar Ma ˜nas, Pietro Astolfi, Melissa Hall, Candace Ross, Jack Urbanek, Adina Williams, Aishwarya Agrawal, Adriana Romero-Soriano, and Michal Drozdzal. Improving text-to- image consistency via automatic prompt optimization. arXiv preprint arXiv:2403.17804, 2024. 2, 3, 5, 6, 7, 12, 13, 14, 21
-
[31]
Guided image synthesis via initial image editing in diffusion model
Jiafeng Mao, Xueting Wang, and Kiyoharu Aizawa. Guided image synthesis via initial image editing in diffusion model. In Proceedings of the 31st ACM International Conference on Multimedia, pages 5321–5329, 2023. 2, 5
work page 2023
- [32]
- [33]
-
[35]
Bleu: a method for automatic evaluation of machine translation
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318,
- [36]
-
[37]
SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis
Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas M¨uller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952, 2023. 6
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[38]
Not all noises are created equally: Diffusion noise selection and optimization
Zipeng Qi, Lichen Bai, Haoyi Xiong, et al. Not all noises are created equally: Diffusion noise selection and optimization. arXiv preprint arXiv:2407.14041, 2024. 2, 5
-
[39]
Learning 10 transferable visual models from natural language supervi- sion
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning 10 transferable visual models from natural language supervi- sion. In International conference on machine learning, pages 8748–8763. PMLR, 2021. 6, 13
work page 2021
-
[40]
High-resolution image synthesis with latent diffusion models
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), 2022. 2
work page 2022
-
[41]
Improved techniques for training gans
Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans. Advances in neural information processing systems, 29, 2016. 6, 13
work page 2016
-
[42]
Make-A-Video: Text-to-Video Generation without Text-Video Data
Uriel Singer, Adam Polyak, Thomas Hayes, Xi Yin, Jie An, Songyang Zhang, Qiyuan Hu, Harry Yang, Oron Ashual, Oran Gafni, et al. Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:2209.14792,
work page internal anchor Pith review Pith/arXiv arXiv
-
[43]
Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever. Consistency models. arXiv preprint arXiv:2303.01469, 2023. 2
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[44]
Dreamsync: Aligning text- to-image generation with image understanding feedback
Jiao Sun, Deqing Fu, Yushi Hu, Su Wang, Royi Rassin, Da-Cheng Juan, Dana Alon, Charles Herrmann, Sjoerd van Steenkiste, Ranjay Krishna, et al. Dreamsync: Aligning text- to-image generation with image understanding feedback. In Synthetic Data for Computer Vision Workshop@ CVPR 2024,
work page 2024
-
[45]
T2v-compbench: A comprehen- sive benchmark for compositional text-to-video generation
Kaiyue Sun, Kaiyi Huang, Xian Liu, Yue Wu, Zihan Xu, Zhenguo Li, and Xihui Liu. T2v-compbench: A comprehen- sive benchmark for compositional text-to-video generation. arXiv preprint arXiv:2407.14505, 2024. 2, 6, 13
-
[46]
Spatial- aware latent initialization for controllable image generation
Wenqiang Sun, Teng Li, Zehong Lin, and Jun Zhang. Spatial- aware latent initialization for controllable image generation. arXiv preprint arXiv:2401.16157, 2024. 2, 5
-
[47]
Raft: Recurrent all-pairs field transforms for optical flow
Zachary Teed and Jia Deng. Raft: Recurrent all-pairs field transforms for optical flow. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part II 16, pages 402–419. Springer, 2020. 6, 13
work page 2020
-
[48]
Videotetris: Towards compositional text-to-video generation
Ye Tian, Ling Yang, Haotian Yang, Yuan Gao, Yufan Deng, Jingmin Chen, Xintao Wang, Zhaochen Yu, Xin Tao, Pengfei Wan, et al. Videotetris: Towards compositional text-to-video generation. In Advances in Neural Information Processing Systems (NeurIPS), 2024. 6, 7
work page 2024
-
[49]
ModelScope Text-to-Video Technical Report
Jiuniu Wang, Hangjie Yuan, Dayou Chen, Yingya Zhang, Xiang Wang, and Shiwei Zhang. Modelscope text-to-video technical report. arXiv preprint arXiv:2308.06571, 2023. 2, 6, 7
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[50]
Videomae v2: Scaling video masked autoencoders with dual masking
Limin Wang, Bingkun Huang, Zhiyu Zhao, Zhan Tong, Yi- nan He, Yi Wang, Yali Wang, and Yu Qiao. Videomae v2: Scaling video masked autoencoders with dual masking. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14549–14560, 2023. 6, 13
work page 2023
-
[51]
Videolcm: Video latent consistency model, 2023
Xiang Wang, Shiwei Zhang, Han Zhang, Yu Liu, Yingya Zhang, Changxin Gao, and Nong Sang. Videolcm: Video latent consistency model, 2023. 2
work page 2023
-
[52]
A recipe for scaling up text-to-video generation with text-free videos
Xiang Wang, Shiwei Zhang, Hangjie Yuan, Zhiwu Qing, Biao Gong, Yingya Zhang, Yujun Shen, Changxin Gao, and Nong Sang. A recipe for scaling up text-to-video generation with text-free videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6572– 6582, 2024. 2
work page 2024
-
[53]
Haoning Wu, Erli Zhang, Liang Liao, Chaofeng Chen, Jing- wen Hou, Annan Wang, Wenxiu Sun, Qiong Yan, and Weisi Lin. Exploring video quality assessment on user generated contents from aesthetic and technical perspectives. In Pro- ceedings of the IEEE/CVF International Conference on Com- puter Vision, pages 20144–20154, 2023. 6, 13
work page 2023
-
[54]
Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation
Jay Zhangjie Wu, Yixiao Ge, Xintao Wang, Stan Weixian Lei, Yuchao Gu, Yufei Shi, Wynne Hsu, Ying Shan, Xiaohu Qie, and Mike Zheng Shou. Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation. In Proceedings of the International Conference on Computer Vision (ICCV), 2023. 2
work page 2023
-
[55]
Self-correcting llm-controlled diffusion models
Tsung-Han Wu, Long Lian, Joseph E Gonzalez, Boyi Li, and Trevor Darrell. Self-correcting llm-controlled diffusion models. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), 2024. 2, 3, 4, 6, 7, 12, 13, 14
work page 2024
-
[56]
Compositional video gener- ation as flow equalization
Xingyi Yang and Xinchao Wang. Compositional video gener- ation as flow equalization. arXiv preprint arXiv:2407.06182,
-
[57]
CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer
Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiao- han Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer. arXiv preprint arXiv:2408.06072, 2024. 2, 7
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[58]
Show-1: Marrying pixel and latent diffusion models for text-to-video generation
David Junhao Zhang, Jay Zhangjie Wu, Jia-Wei Liu, Rui Zhao, Lingmin Ran, Yuchao Gu, Difei Gao, and Mike Zheng Shou. Show-1: Marrying pixel and latent diffusion models for text-to-video generation. International Journal of Computer Vision, pages 1–15, 2024. 6, 7
work page 2024
-
[59]
Yabo Zhang, Yuxiang Wei, Dongsheng Jiang, Xiaopeng Zhang, Wangmeng Zuo, and Qi Tian. Controlvideo: Training- free controllable text-to-video generation. arXiv preprint arXiv:2305.13077, 2023. 2 11 Appendix A. V IDEO REPAIR Implementation Details 12 A.1. Question Generation . . . . . . . . . . . . . 12 A.2. Visual Question Answering . . . . . . . . . 12 A....
-
[60]
For ex- ample, video ranking is not applied when K = 1, and only one refinement is produced using a single random seed noise ϵ′
-
[61]
For ranking metrics, we rely on DSG Obj across all ab- lation studies. As depicted in Fig. 8, higher K values (5, 10, and 20) consistently yield higher scores across all cate- gories than K = 1. This trend is particularly prominent in the ‘count’ category, where increasingK leads to noticeable performance improvements, highlighting the importance of consi...
-
[62]
Given the question: "{cur_question}", provide a brief reasoning (up to two sentences) to determine the accurate answer
-
[63]
Respond to the question using binary values: 1.0 for "Yes" and 0.0 for "No". If the answer is uncertain or unnatural due to image distortion or other issues, respond with 0.0 ("No")
-
[64]
Return the number of "{key_objects}" (as an integer) mentioned in the initial prompt "{cur_question}"
-
[65]
Return the number of "{key_objects}" (as an integer) in the provided image. Return the result as a dictionary in the following format (not in JSON format): {{"Q": "<question>", "A": <binary answer>, "reasoning": "<brief reasoning>", "obj_in_prompt": <number of key object mentioned in the initial prompt>, "obj_in_img": <number of key object in the image>}}...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.