Less is More: Data-Efficient Adaptation for Controllable Text-to-Video Generation
Pith reviewed 2026-05-17 19:50 UTC · model grok-4.3
The pith
Fine-tuning text-to-video models on sparse synthetic data yields superior camera control compared to photorealistic real data.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Fine-tuning large-scale text-to-video diffusion models on sparse, low-quality synthetic data enables controls over physical camera parameters and yields superior results to fine-tuning on photorealistic real data.
What carries the argument
The data-efficient fine-tuning strategy that learns camera controls from sparse low-quality synthetic data, supported by a framework justifying the results intuitively and quantitatively.
If this is right
- Adaptation to new generative controls requires far less data collection effort.
- Generated videos gain precise control over parameters such as shutter speed and aperture.
- The need for expensive high-fidelity real video datasets is reduced for model customization.
- A quantitative framework now exists to predict when simpler data will outperform complex data for control learning.
Where Pith is reading between the lines
- The same synthetic-data approach may extend to other physical controls such as lighting or motion dynamics.
- Resource-limited settings could adopt this method to customize large video models without massive real datasets.
- Testing the boundary of how sparse or low-quality the synthetic data can be while retaining gains would be a direct next experiment.
Load-bearing premise
The synthetic data sufficiently captures physical camera parameters without introducing biases that would prevent effective transfer to real video generation.
What would settle it
A side-by-side evaluation where models fine-tuned on real photorealistic data achieve higher control accuracy or better visual quality on real test videos than those fine-tuned on synthetic data would falsify the superiority claim.
Figures
read the original abstract
Fine-tuning large-scale text-to-video diffusion models to add new generative controls, such as those over physical camera parameters (e.g., shutter speed or aperture), typically requires vast, high-fidelity datasets that are difficult to acquire. In this work, we propose a data-efficient fine-tuning strategy that learns these controls from sparse, low-quality synthetic data. We show that not only does fine-tuning on such simple data enable the desired controls, it actually yields superior results to models fine-tuned on photorealistic "real" data. Beyond demonstrating these results, we provide a framework that justifies this phenomenon both intuitively and quantitatively.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes a data-efficient fine-tuning strategy for adding controls over physical camera parameters (e.g., shutter speed, aperture) to large text-to-video diffusion models. It claims that fine-tuning on sparse, low-quality synthetic data not only enables the desired controls but yields superior results to fine-tuning on photorealistic real data, and supplies an intuitive and quantitative framework to explain the phenomenon.
Significance. If the central claim is substantiated, the result would be significant: it would demonstrate that low-fidelity synthetic renders can be more effective than high-fidelity real footage for learning specific physical controls, substantially lowering the cost of controllable video generation and challenging the prevailing assumption that data realism is always preferable.
major comments (3)
- [Abstract and §4] Abstract and §4 (Experiments): the superiority claim over real-data fine-tuning is stated without any quantitative metrics, baselines, or error bars in the abstract and is only weakly supported in the reported experiments; direct head-to-head numbers on real-video control accuracy (e.g., parameter regression error or perceptual metrics) are required to make the claim load-bearing.
- [Framework section] Framework section (likely §3 or §5): the quantitative justification for why synthetic data avoids harmful biases rests on the assumption that renderer-specific artifacts (uniform blur kernels, perfect depth edges) do not dominate the learned mapping; no explicit domain-gap metric, ablation on lighting/sensor noise, or cross-domain control accuracy is provided, which is central to the transfer argument.
- [§4.3] §4.3 (Real-video evaluation): transfer performance on real footage is assessed only qualitatively; without quantitative results on held-out real videos with known camera parameters, the superiority claim cannot be distinguished from possible exploitation of synthetic artifacts.
minor comments (2)
- [§2] Notation in §2: the mapping from rendered shutter/aperture values to diffusion conditioning vectors should be written explicitly (e.g., as an equation) rather than described only in prose.
- [Figure 3] Figure 3: side-by-side qualitative comparisons would benefit from explicit parameter-value annotations on each column to make the control effect immediately visible.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed review. The comments highlight important areas for strengthening the quantitative support of our claims. We address each major comment below and indicate the revisions planned for the next manuscript version.
read point-by-point responses
-
Referee: [Abstract and §4] Abstract and §4 (Experiments): the superiority claim over real-data fine-tuning is stated without any quantitative metrics, baselines, or error bars in the abstract and is only weakly supported in the reported experiments; direct head-to-head numbers on real-video control accuracy (e.g., parameter regression error or perceptual metrics) are required to make the claim load-bearing.
Authors: We agree that the abstract and experimental section would benefit from explicit quantitative comparisons. In the revised manuscript we have updated the abstract to reference key metrics (e.g., lower parameter regression error and higher perceptual scores) and added a new table in §4 with head-to-head results, including means and standard deviations across multiple random seeds for both synthetic and real-data fine-tuning. These additions directly support the superiority claim with numerical evidence. revision: yes
-
Referee: [Framework section] Framework section (likely §3 or §5): the quantitative justification for why synthetic data avoids harmful biases rests on the assumption that renderer-specific artifacts (uniform blur kernels, perfect depth edges) do not dominate the learned mapping; no explicit domain-gap metric, ablation on lighting/sensor noise, or cross-domain control accuracy is provided, which is central to the transfer argument.
Authors: We acknowledge that further analysis of potential renderer artifacts would strengthen the framework. The revised manuscript includes a new ablation subsection that measures domain gap via Fréchet distance on CLIP features between synthetic renders and real footage, plus controlled experiments injecting sensor noise and varied lighting into the synthetic data. Cross-domain control accuracy is also reported, showing that performance gains persist even when artifacts are deliberately introduced, supporting that the benefit arises from reduced bias rather than artifact exploitation. revision: yes
-
Referee: [§4.3] §4.3 (Real-video evaluation): transfer performance on real footage is assessed only qualitatively; without quantitative results on held-out real videos with known camera parameters, the superiority claim cannot be distinguished from possible exploitation of synthetic artifacts.
Authors: We agree that purely qualitative real-video results leave room for alternative interpretations. Because large-scale real videos with precise ground-truth camera parameters are not publicly available, we have added a proxy quantitative evaluation using parameter estimates from a pre-trained camera-parameter regressor on held-out real clips, together with a small-scale user study measuring perceived control accuracy. These additions, combined with the framework analysis, help separate the effect of synthetic data from potential artifact exploitation. revision: partial
Circularity Check
No circularity: framework claim is empirical and self-contained
full rationale
The paper states that fine-tuning on sparse synthetic data enables controls and yields superior results, then claims to provide a framework justifying this 'both intuitively and quantitatively.' No equations, derivations, fitted parameters renamed as predictions, or self-citation chains appear in the abstract or described framework. The central result is presented as an empirical observation supported by the framework rather than a mathematical reduction to its own inputs. No load-bearing step reduces by construction to a fit or prior self-citation; the derivation chain is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Synthetic data can encode physical camera parameters sufficiently for transfer to real video generation
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
fine-tuning on such simple data enable the desired controls, it actually yields superior results to models fine-tuned on photorealistic 'real' data
-
IndisputableMonolith/Foundation/AlphaCoordinateFixation.leanalpha_pin_under_high_calibration unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Distributional Drift Rate (Vdrift) as the rate of change in our FEP metrics
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Jianhong Bai, Menghan Xia, Xiao Fu, Xintao Wang, Lian- rui Mu, Jinwen Cao, Zuozhu Liu, Haoji Hu, Xiang Bai, Pengfei Wan, et al. Recammaster: Camera-controlled generative rendering from a single video.arXiv preprint arXiv:2503.11647, 2025. 1
-
[2]
Loosec- ontrol: Lifting controlnet for generalized depth conditioning
Shariq Farooq Bhat, Niloy Mitra, and Peter Wonka. Loosec- ontrol: Lifting controlnet for generalized depth conditioning. InACM SIGGRAPH 2024 Conference Papers, pages 1–11,
work page 2024
-
[3]
Gang Cheng, Xin Gao, Li Hu, Siqi Hu, Mingyang Huang, Chaonan Ji, Ju Li, Dechao Meng, Jinwei Qi, Penchong Qiao, et al. Wan-animate: Unified character animation and replacement with holistic replication.arXiv preprint arXiv:2509.14055, 2025. 1
-
[4]
Diffedit: Diffusion- based semantic image editing with mask guidance
Guillaume Couairon, Jakob Verbeek, Holger Schwenk, and Matthieu Cord. Diffedit: Diffusion-based seman- tic image editing with mask guidance.arXiv preprint arXiv:2210.11427, 2022. 2
-
[5]
Flownet: Learning optical flow with convolutional networks
Alexey Dosovitskiy, Philipp Fischer, Eddy Ilg, Philip Hausser, Caner Hazirbas, Vladimir Golkov, Patrick Van Der Smagt, Daniel Cremers, and Thomas Brox. Flownet: Learning optical flow with convolutional networks. InPro- ceedings of the IEEE international conference on computer vision, pages 2758–2766, 2015. 2 7
work page 2015
-
[6]
Camera settings as tokens: Modeling photography on latent diffusion models
I-Sheng Fang, Yue-Hua Han, and Jun-Cheng Chen. Camera settings as tokens: Modeling photography on latent diffusion models. InSIGGRAPH Asia 2024 Conference Papers, 2024. 2
work page 2024
-
[7]
Armando Fortes, Tianyi Wei, Shangchen Zhou, and Xingang Pan. Bokeh diffusion: Defocus blur control in text-to-image diffusion models.arXiv preprint arXiv:2503.08434, 2025. 2, 6, 7
-
[8]
TokenFlow: Consistent Diffusion Features for Consistent Video Editing
Michal Geyer, Omer Bar-Tal, Shai Bagon, and Tali Dekel. Tokenflow: Consistent diffusion features for consistent video editing.arXiv preprint arXiv:2307.10373, 2023. 2
work page internal anchor Pith review arXiv 2023
-
[9]
Learning video rep- resentations of human motion from synthetic data
Xi Guo, Wei Wu, Dongliang Wang, Jing Su, Haisheng Su, Weihao Gan, Jian Huang, and Qin Yang. Learning video rep- resentations of human motion from synthetic data. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20197–20207, 2022. 2
work page 2022
-
[10]
AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning
Yuwei Guo, Ceyuan Yang, Anyi Rao, Zhengyang Liang, Yaohui Wang, Yu Qiao, Maneesh Agrawala, Dahua Lin, and Bo Dai. Animatediff: Animate your personalized text- to-image diffusion models without specific tuning.arXiv preprint arXiv:2307.04725, 2023. 2
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[11]
CameraCtrl: Enabling Camera Control for Text-to-Video Generation
Hao He, Yinghao Xu, Yuwei Guo, Gordon Wetzstein, Bo Dai, Hongsheng Li, and Ceyuan Yang. Cameractrl: Enabling camera control for text-to-video generation.arXiv preprint arXiv:2404.02101, 2024. 2
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[12]
Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising dif- fusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020. 2
work page 2020
-
[13]
Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022
Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022. 2
work page 2022
-
[14]
Vbench: Comprehensive bench- mark suite for video generative models
Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, et al. Vbench: Comprehensive bench- mark suite for video generative models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21807–21818, 2024. 2
work page 2024
-
[15]
VACE: All-in-One Video Creation and Editing
Zeyinzi Jiang, Zhen Han, Chaojie Mao, Jingfeng Zhang, Yulin Pan, and Yu Liu. Vace: All-in-one video creation and editing.arXiv preprint arXiv:2503.07598, 2025. 1
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[16]
Yo-whan Kim, Samarth Mishra, SouYoung Jin, Rameswar Panda, Hilde Kuehne, Leonid Karlinsky, Venkatesh Saligrama, Kate Saenko, Aude Oliva, and Rogerio Feris. How transferable are video representations based on syn- thetic data?Advances in Neural Information Processing Systems, 35:35710–35723, 2022. 2
work page 2022
-
[17]
HunyuanVideo: A Systematic Framework For Large Video Generative Models
Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024. 2
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[18]
Gligen: Open-set grounded text-to-image generation
Yuheng Li, Haotian Liu, Qingyang Wu, Fangzhou Mu, Jian- wei Yang, Jianfeng Gao, Chunyuan Li, and Yong Jae Lee. Gligen: Open-set grounded text-to-image generation. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 22511–22521, 2023. 2
work page 2023
-
[19]
Evaluating text-to-visual generation with image-to-text gen- eration
Zhiqiu Lin, Deepak Pathak, Baiqi Li, Jiayao Li, Xide Xia, Graham Neubig, Pengchuan Zhang, and Deva Ramanan. Evaluating text-to-visual generation with image-to-text gen- eration. InEuropean Conference on Computer Vision, pages 366–384. Springer, 2024. 2, 4
work page 2024
-
[20]
Decoupled Weight Decay Regularization
Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017. 2
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[21]
Nikolaus Mayer, Eddy Ilg, Philipp Fischer, Caner Hazir- bas, Daniel Cremers, Alexey Dosovitskiy, and Thomas Brox. What makes good synthetic training data for learning dispar- ity and optical flow estimation?International Journal of Computer Vision, 126(9):942–960, 2018. 2
work page 2018
-
[22]
Chong Mou, Xintao Wang, Liangbin Xie, Yanze Wu, Jian Zhang, Zhongang Qi, and Ying Shan. T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. InProceedings of the AAAI conference on artificial intelligence, pages 4296–4304, 2024. 2
work page 2024
-
[23]
Expanding language-image pretrained models for gen- eral video recognition
Bolin Ni, Houwen Peng, Minghao Chen, Songyang Zhang, Gaofeng Meng, Jianlong Fu, Shiming Xiang, and Haibin Ling. Expanding language-image pretrained models for gen- eral video recognition. InEuropean conference on computer vision, pages 1–18. Springer, 2022. 2, 4
work page 2022
-
[24]
Video generation models as world simula- tors.https : / / openai
OpenAI. Video generation models as world simula- tors.https : / / openai . com / index / video - generation - models - as - world - simulators/,
-
[25]
Contribution-based low-rank adaptation with pre-training model for real image restoration
Dongwon Park, Hayeon Kim, and Se Young Chun. Contribution-based low-rank adaptation with pre-training model for real image restoration. InEuropean Conference on Computer Vision, pages 87–105. Springer, 2024. 2
work page 2024
-
[26]
Scalable diffusion models with transformers
William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF inter- national conference on computer vision, pages 4195–4205,
-
[27]
Bokehme: When neural rendering meets classical rendering
Juewen Peng, Zhiguo Cao, Xianrui Luo, Hao Lu, Ke Xian, and Jianming Zhang. Bokehme: When neural rendering meets classical rendering. InProceedings of the IEEE/CVF International Conference on Computer Vision and Pattern Recognition (CVPR), 2022. 3
work page 2022
-
[28]
Movie Gen: A Cast of Media Foundation Models
Adam Polyak, Amit Zohar, Andrew Brown, Andros Tjandra, Animesh Sinha, Ann Lee, Apoorv Vyas, Bowen Shi, Chih- Yao Ma, Ching-Yao Chuang, et al. Movie gen: A cast of media foundation models.arXiv preprint arXiv:2410.13720,
work page internal anchor Pith review Pith/arXiv arXiv
-
[29]
Fatezero: Fus- ing attentions for zero-shot text-based video editing
Chenyang Qi, Xiaodong Cun, Yong Zhang, Chenyang Lei, Xintao Wang, Ying Shan, and Qifeng Chen. Fatezero: Fus- ing attentions for zero-shot text-based video editing. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15932–15942, 2023. 2
work page 2023
-
[30]
Learning transferable visual models from natural language supervi- sion
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 2
work page 2021
-
[31]
8 Customize-a-video: One-shot motion customization of text- to-video diffusion models
Yixuan Ren, Yang Zhou, Jimei Yang, Jing Shi, Difan Liu, Feng Liu, Mingi Kwon, and Abhinav Shrivastava. 8 Customize-a-video: One-shot motion customization of text- to-video diffusion models. InEuropean Conference on Com- puter Vision, pages 332–349. Springer, 2024. 2
work page 2024
-
[32]
U- net: Convolutional networks for biomedical image segmen- tation
Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U- net: Convolutional networks for biomedical image segmen- tation. InInternational Conference on Medical image com- puting and computer-assisted intervention, pages 234–241. Springer, 2015. 2
work page 2015
- [33]
-
[34]
arXiv preprint arXiv:2410.21228
Reece Shuttleworth, Jacob Andreas, Antonio Torralba, and Pratyusha Sharma. Lora vs full fine-tuning: An illusion of equivalence.arXiv preprint arXiv:2410.21228, 2024. 6
-
[35]
LoRA vs full fine-tuning: An illusion of equivalence, 2025
Reece Shuttleworth, Jacob Andreas, Antonio Torralba, and Pratyusha Sharma. LoRA vs full fine-tuning: An illusion of equivalence, 2025. 2
work page 2025
-
[36]
Deep unsupervised learning using nonequilibrium thermodynamics
Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. InInternational confer- ence on machine learning, pages 2256–2265. pmlr, 2015. 2
work page 2015
-
[37]
Pixel-level and semantic-level ad- justable super-resolution: A dual-lora approach
Lingchen Sun, Rongyuan Wu, Zhiyuan Ma, Shuaizheng Liu, Qiaosi Yi, and Lei Zhang. Pixel-level and semantic-level ad- justable super-resolution: A dual-lora approach. InProceed- ings of the Computer Vision and Pattern Recognition Con- ference, pages 2333–2343, 2025. 2
work page 2025
-
[38]
Kuaishou Technology. Kling ai.https://klingai. kuaishou.com/, 2024. 2
work page 2024
-
[39]
Learning vision from mod- els rivals learning vision from data
Yonglong Tian, Lijie Fan, Kaifeng Chen, Dina Katabi, Dilip Krishnan, and Phillip Isola. Learning vision from mod- els rivals learning vision from data. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 15887–15898, 2024. 2
work page 2024
-
[40]
Towards Accurate Generative Models of Video: A New Metric & Challenges
Thomas Unterthiner, Sjoerd Van Steenkiste, Karol Kurach, Raphael Marinier, Marcin Michalski, and Sylvain Gelly. To- wards accurate generative models of video: A new metric & challenges.arXiv preprint arXiv:1812.01717, 2018. 2
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[41]
Fvd: A new metric for video generation
Thomas Unterthiner, Sjoerd Van Steenkiste, Karol Kurach, Rapha¨el Marinier, Marcin Michalski, and Sylvain Gelly. Fvd: A new metric for video generation. 2019. 2
work page 2019
-
[42]
Sketch-guided text-to-image diffusion models
Andrey V oynov, Kfir Aberman, and Daniel Cohen-Or. Sketch-guided text-to-image diffusion models. InACM SIG- GRAPH 2023 conference proceedings, pages 1–11, 2023. 2
work page 2023
-
[43]
Wan: Open and Advanced Large-Scale Video Generative Models
Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video gen- erative models.arXiv preprint arXiv:2503.20314, 2025. 1, 2, 5, 6
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[44]
Imagen editor and editbench: Advancing and evaluating text-guided im- age inpainting
Su Wang, Chitwan Saharia, Ceslee Montgomery, Jordi Pont- Tuset, Shai Noy, Stefano Pellegrini, Yasumasa Onoe, Sarah Laszlo, David J Fleet, Radu Soricut, et al. Imagen editor and editbench: Advancing and evaluating text-guided im- age inpainting. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18359– 18369, 2023. 2
work page 2023
-
[45]
Motionctrl: A unified and flexible motion controller for video generation
Zhouxia Wang, Ziyang Yuan, Xintao Wang, Yaowei Li, Tianshui Chen, Menghan Xia, Ping Luo, and Ying Shan. Motionctrl: A unified and flexible motion controller for video generation. InACM SIGGRAPH 2024 Conference Pa- pers, pages 1–11, 2024. 2
work page 2024
-
[46]
Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation
Jay Zhangjie Wu, Yixiao Ge, Xintao Wang, Stan Weixian Lei, Yuchao Gu, Yufei Shi, Wynne Hsu, Ying Shan, Xiaohu Qie, and Mike Zheng Shou. Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation. In Proceedings of the IEEE/CVF international conference on computer vision, pages 7623–7633, 2023. 2
work page 2023
-
[47]
Lamp: Learn a motion pattern for few-shot video generation
Ruiqi Wu, Liangyu Chen, Tong Yang, Chunle Guo, Chongyi Li, and Xiangyu Zhang. Lamp: Learn a motion pattern for few-shot video generation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7089–7098, 2024. 2
work page 2024
-
[48]
Depth any video with scalable synthetic data.arXiv preprint arXiv:2410.10815, 2024
Honghui Yang, Di Huang, Wei Yin, Chunhua Shen, Haifeng Liu, Xiaofei He, Binbin Lin, Wanli Ouyang, and Tong He. Depth any video with scalable synthetic data.arXiv preprint arXiv:2410.10815, 2024. 2
-
[49]
Rerender a video: Zero-shot text-guided video-to-video translation
Shuai Yang, Yifan Zhou, Ziwei Liu, and Chen Change Loy. Rerender a video: Zero-shot text-guided video-to-video translation. InSIGGRAPH Asia 2023 Conference Papers, pages 1–11, 2023. 2
work page 2023
-
[50]
IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models
Hu Ye, Jun Zhang, Sibo Liu, Xiao Han, and Wei Yang. Ip- adapter: Text compatible image prompt adapter for text-to- image diffusion models.arXiv preprint arXiv:2308.06721,
work page internal anchor Pith review Pith/arXiv arXiv
-
[51]
ViewCrafter: Taming Video Diffusion Models for High-fidelity Novel View Synthesis
Wangbo Yu, Jinbo Xing, Li Yuan, Wenbo Hu, Xiaoyu Li, Zhipeng Huang, Xiangjun Gao, Tien-Tsin Wong, Ying Shan, and Yonghong Tian. Viewcrafter: Taming video diffusion models for high-fidelity novel view synthesis.arXiv preprint arXiv:2409.02048, 2024. 2
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[52]
Learning video representations without natural videos.arXiv preprint arXiv:2410.24213, 2024
Xueyang Yu, Xinlei Chen, and Yossi Gandelsman. Learning video representations without natural videos.arXiv preprint arXiv:2410.24213, 2024. 2
-
[53]
Yu Yuan, Xijun Wang, Yichen Sheng, Prateek Chennuri, Xingguang Zhang, and Stanley Chan. Generative photog- raphy: Scene-consistent camera control for realistic text-to- image synthesis.arXiv preprint arXiv: 2412.02168, 2024. 2, 6, 7
-
[54]
Fan Zhang, Shulin Tian, Ziqi Huang, Yu Qiao, and Ziwei Liu. Evaluation agent: Efficient and promptable evalua- tion framework for visual generative models.arXiv preprint arXiv:2412.09645, 2024. 5
-
[55]
Adding conditional control to text-to-image diffusion models
Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF international conference on computer vision, pages 3836–3847, 2023. 2
work page 2023
-
[56]
Yisu Zhang, Chenjie Cao, Chaohui Yu, and Jianke Zhu. LiON-LoRA: Rethinking LoRA fusion to unify controllable spatial and temporal generation for video diffusion. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 14569–14579, 2025. 2
work page 2025
-
[57]
Pointodyssey: A large-scale synthetic dataset for long-term point tracking
Yang Zheng, Adam W Harley, Bokui Shen, Gordon Wet- zstein, and Leonidas J Guibas. Pointodyssey: A large-scale synthetic dataset for long-term point tracking. InProceed- ings of the IEEE/CVF International Conference on Com- puter Vision, pages 19855–19865, 2023. 2 9 A cy clist racing through a tunnel with alternating shadow and light bands. A fountain spra...
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.