VDE: Training-Free Accelerating Rectified Flow Model via Velocity Decomposition and Estimation
Pith reviewed 2026-05-25 04:55 UTC · model grok-4.3
The pith
VDE accelerates rectified flow models by decomposing velocity into parallel and orthogonal components to the input for estimation without training.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
VDE decomposes the model's velocity into components parallel and orthogonal to the input, exploits their temporal predictability and directional stability for precise input-adaptive estimation without training, and inserts periodic full forward passes to anchor the state and prevent error accumulation, enabling faster inference in rectified flow models while preserving output fidelity.
What carries the argument
Decomposition of velocity into input-parallel and input-orthogonal components, estimated from their temporal predictability and directional stability with periodic anchoring via full passes.
If this is right
- Rectified flow models achieve multi-fold inference speedup on image and video tasks with only minor quality degradation measured by LPIPS.
- No model retraining or fine-tuning is required to apply the acceleration.
- The approach outperforms feature-caching baselines by reducing the mismatch between cached states and evolving inputs.
- Periodic full passes limit error growth, allowing longer acceleration sequences without collapse in fidelity.
Where Pith is reading between the lines
- The same decomposition principle could be tested on non-rectified diffusion models to check if parallel-orthogonal stability generalizes.
- Measuring how the angle between velocity and input evolves across steps might yield a diagnostic for when estimation remains reliable.
- Applying VDE to conditional generation with text or layout inputs could reveal whether the stability holds under stronger guidance.
Load-bearing premise
The velocity components parallel and orthogonal to the input possess temporal predictability and directional stability that permit accurate estimation from prior steps without training.
What would settle it
Compare the true velocity vectors computed by the full model against VDE's estimated vectors at multiple denoising steps on the same input sequence; large consistent deviations would cause visible quality loss.
Figures
read the original abstract
Though rectified flow models have achieved remarkable performance in image, video, and 3D generation, their practical deployments are challenged by slow inference speeds. Prior acceleration methods reuse cached features from previous steps, which neglects the growing mismatch between static caches and the evolving input, leading to reduced output fidelity. This work proposes Velocity Decomposition and Estimation (VDE), a training-free acceleration method that shifts the paradigm from caching-and-reusing to decomposing-and-estimating. Specifically, VDE decomposes the model's velocity into components parallel and orthogonal to the input, exploiting their temporal predictability and directional stability for precise, input-adaptive estimation. To prevent error accumulation, it periodically anchors the model's state via full forward passes. Extensive experiments on image and video generation tasks demonstrate that VDE achieves substantial acceleration with minimal loss in visual quality. Notably, VDE accelerates Flux by 3.22 times and achieves an LPIPS of 0.069 on Qwen-Image, outperforming the best baseline with a 52.2% reduction.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes Velocity Decomposition and Estimation (VDE), a training-free acceleration technique for rectified flow models. It decomposes the velocity into components parallel and orthogonal to the input x, estimates future values from claimed temporal predictability and directional stability of those components, and periodically performs full forward passes to reset state and limit error accumulation. Experiments on image and video tasks report speedups including 3.22× on Flux and LPIPS=0.069 on Qwen-Image, outperforming baselines by 52.2%.
Significance. If the predictability and stability properties of the decomposed velocity components can be rigorously established, the method would offer a practical training-free route to faster inference in large rectified-flow models while preserving output fidelity, which is valuable for deployment in image and video generation pipelines.
major comments (2)
- [Abstract / method description] Abstract and method description: the central claim that the parallel and orthogonal velocity components possess 'temporal predictability and directional stability' permitting accurate input-adaptive estimation is asserted without derivation, bound, or analysis showing why these properties hold across denoising steps or different rectified-flow models. This assumption is load-bearing for the training-free estimation step and error-control argument.
- [Experiments] Experiments section: no ablation studies, error analysis, or protocol details are supplied to isolate whether the reported 3.22× speedup on Flux and LPIPS=0.069 on Qwen-Image arise from the decomposition/estimation or from the periodic full passes and other implementation choices. This prevents verification that the quantitative gains support the core claim.
minor comments (2)
- [Method] Notation for the parallel/orthogonal decomposition (v_∥ and v_⊥) should be introduced with an explicit equation early in the method section for clarity.
- [Abstract / Experiments] The abstract mentions 'extensive experiments on image and video generation tasks' but does not list the exact models, datasets, or metrics used beyond the two highlighted examples; a table summarizing all evaluated settings would improve reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address each major comment below and commit to revisions that strengthen the presentation without altering the core claims.
read point-by-point responses
-
Referee: [Abstract / method description] Abstract and method description: the central claim that the parallel and orthogonal velocity components possess 'temporal predictability and directional stability' permitting accurate input-adaptive estimation is asserted without derivation, bound, or analysis showing why these properties hold across denoising steps or different rectified-flow models. This assumption is load-bearing for the training-free estimation step and error-control argument.
Authors: We agree that the manuscript motivates the properties of temporal predictability and directional stability from the structure of rectified flows but does not supply a formal derivation or bounds. The decomposition follows directly from projecting the velocity onto the input direction and its orthogonal complement, and the stability arises because the parallel component tracks the data manifold while the orthogonal tracks residual noise. In revision we will add a short analysis subsection with (i) a Lipschitz-based argument on why the components evolve more slowly than the full velocity and (ii) empirical plots of component-wise change across steps and models to support the estimation procedure. revision: yes
-
Referee: [Experiments] Experiments section: no ablation studies, error analysis, or protocol details are supplied to isolate whether the reported 3.22× speedup on Flux and LPIPS=0.069 on Qwen-Image arise from the decomposition/estimation or from the periodic full passes and other implementation choices. This prevents verification that the quantitative gains support the core claim.
Authors: We concur that the current experiments do not fully isolate the contribution of the decomposition-and-estimation step from the periodic anchoring. In the revised version we will add (i) an ablation that disables the adaptive estimation while retaining anchoring, (ii) per-step error curves (L2 velocity error and LPIPS) with and without anchoring, and (iii) explicit protocol details on anchoring frequency, estimation horizon, and implementation choices. These additions will allow readers to attribute the reported speed-ups and fidelity metrics to the proposed VDE components. revision: yes
Circularity Check
No circularity: derivation rests on stated velocity properties without reduction to fitted inputs or self-citations.
full rationale
The paper's core step decomposes velocity into parallel/orthogonal components to the input and estimates future values from claimed temporal predictability and directional stability. This is presented as an exploitation of inherent rectified-flow properties rather than any fitted parameter, self-referential prediction, or load-bearing self-citation. No equations reduce the estimation to the inputs by construction, and the training-free framing avoids any internal fitting loop. The provided text contains no self-citations or ansatz smuggling that would trigger the enumerated patterns.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Flux.https://github.com/bla ck-forest-labs/flux, 2024
Black Forest Labs. Flux.https://github.com/bla ck-forest-labs/flux, 2024. 1, 2, 5, 7
2024
-
[2]
Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets
Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023. 3
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[3]
Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer
Huanqia Cai, Sihan Cao, Ruoyi Du, Peng Gao, Steven Hoi, Shijie Huang, Zhaohui Hou, Dengyang Jiang, Xin Jin, Liangchen Li, et al. Z-image: An efficient image generation foundation model with single-stream diffusion transformer. arXiv preprint arXiv:2511.22699, 2025. 2
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[4]
PixArt-$\alpha$: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis
Junsong Chen, Jincheng Yu, Chongjian Ge, Lewei Yao, Enze Xie, Yue Wu, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, et al. Pixart-α: Fast training of diffusion transformer for photorealistic text-to-image synthesis.arXiv preprint arXiv:2310.00426, 2023. 2
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[5]
Pixart-σ: Weak-to-strong training of diffusion transformer for 4k text-to-image generation
Junsong Chen, Chongjian Ge, Enze Xie, Yue Wu, Lewei Yao, Xiaozhe Ren, Zhongdao Wang, Ping Luo, Huchuan Lu, and Zhenguo Li. Pixart-σ: Weak-to-strong training of diffusion transformer for 4k text-to-image generation. In European Conference on Computer Vision, pages 74–91. Springer, 2024. 2
2024
- [6]
-
[7]
Omnicache: A trajectory-oriented global perspective on training-free cache reuse for diffusion transformer models
Huanpeng Chu, Wei Wu, Guanyu Feng, and Yutao Zhang. Omnicache: A trajectory-oriented global perspective on training-free cache reuse for diffusion transformer models. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 16302–16312, 2025. 1, 3
2025
-
[8]
Scaling recti- fied flow transformers for high-resolution image synthesis
Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas M ¨uller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling recti- fied flow transformers for high-resolution image synthesis. InForty-first international conference on machine learning,
-
[9]
Monocular and generalizable gaussian talking head animation
Shengjie Gong, Haojie Li, Jiapeng Tang, Dongming Hu, Shuangping Huang, Hao Chen, Tianshui Chen, and Zhuo- man Liu. Monocular and generalizable gaussian talking head animation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5523– 5534, 2025. 5
2025
-
[10]
Ptqd: Accurate post-training quantiza- tion for diffusion models.arXiv preprint arXiv:2305.10657,
Yefei He, Luping Liu, Jing Liu, Weijia Wu, Hong Zhou, and Bohan Zhuang. Ptqd: Accurate post-training quantiza- tion for diffusion models.arXiv preprint arXiv:2305.10657,
-
[11]
Denoising dif- fusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020
Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising dif- fusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020. 2
2020
-
[12]
Vbench: Comprehensive bench- mark suite for video generative models
Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, et al. Vbench: Comprehensive bench- mark suite for video generative models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21807–21818, 2024. 6
2024
-
[13]
Hunyuan3D 2.1: From Images to High-Fidelity 3D Assets with Production-Ready PBR Material
Team Hunyuan3D, Shuhui Yang, Mingxin Yang, Yifei Feng, Xin Huang, Sheng Zhang, Zebin He, Di Luo, Haolin Liu, Yunfei Zhao, et al. Hunyuan3d 2.1: From images to high- fidelity 3d assets with production-ready pbr material.arXiv preprint arXiv:2506.15442, 2025. 1, 2
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[14]
Pyramidal flow matching for efficient video generative modeling.arXiv preprint arXiv:2410.05954,
Yang Jin, Zhicheng Sun, Ningyuan Li, Kun Xu, Hao Jiang, Nan Zhuang, Quzhe Huang, Yang Song, Yadong Mu, and Zhouchen Lin. Pyramidal flow matching for efficient video generative modeling.arXiv preprint arXiv:2410.05954,
-
[15]
Adaptive caching for faster video generation with diffu- sion transformers
Kumara Kahatapitiya, Haozhe Liu, Sen He, Ding Liu, Menglin Jia, Chenyang Zhang, Michael S Ryoo, and Tian Xie. Adaptive caching for faster video generation with diffu- sion transformers. InProceedings of the IEEE/CVF Interna- tional Conference on Computer Vision, pages 15240–15252,
-
[16]
HunyuanVideo: A Systematic Framework For Large Video Generative Models
Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024. 1, 2
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[17]
Faster diffusion: Rethinking the role of unet encoder in diffusion models.CoRR, 2023
Senmao Li, Taihang Hu, Fahad Shahbaz Khan, Linxuan Li, Shiqi Yang, Yaxing Wang, Ming-Ming Cheng, and Jian Yang. Faster diffusion: Rethinking the role of unet encoder in diffusion models.CoRR, 2023. 1, 3
2023
-
[18]
Craftsman3d: High-fidelity mesh generation with 3d native diffusion and interactive geometry refiner
Weiyu Li, Jiarui Liu, Hongyu Yan, Rui Chen, Yixun Liang, Xuelin Chen, Ping Tan, and Xiaoxiao Long. Craftsman3d: High-fidelity mesh generation with 3d native diffusion and interactive geometry refiner. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 5307– 5317, 2025. 1, 2
2025
-
[19]
Weiyu Li, Xuanyang Zhang, Zheng Sun, Di Qi, Hao Li, Wei Cheng, Weiwei Cai, Shihao Wu, Jiarui Liu, Zihao Wang, et al. Step1x-3d: Towards high-fidelity and con- trollable generation of textured 3d assets.arXiv preprint arXiv:2505.07747, 2025. 1, 2
-
[20]
Snap- fusion: Text-to-image diffusion model on mobile devices within two seconds.Advances in Neural Information Pro- cessing Systems, 36:20662–20678, 2023
Yanyu Li, Huan Wang, Qing Jin, Ju Hu, Pavlo Chemerys, Yun Fu, Yanzhi Wang, Sergey Tulyakov, and Jian Ren. Snap- fusion: Text-to-image diffusion model on mobile devices within two seconds.Advances in Neural Information Pro- cessing Systems, 36:20662–20678, 2023. 2
2023
-
[21]
Q-dm: An efficient low-bit quantized dif- fusion model.Advances in neural information processing systems, 36:76680–76691, 2023
Yanjing Li, Sheng Xu, Xianbin Cao, Xiao Sun, and Baochang Zhang. Q-dm: An efficient low-bit quantized dif- fusion model.Advances in neural information processing systems, 36:76680–76691, 2023. 2
2023
-
[22]
TripoSG: High-Fidelity 3D Shape Synthesis using Large-Scale Rectified Flow Models
Yangguang Li, Zi-Xin Zou, Zexiang Liu, Dehu Wang, Yuan Liang, Zhipeng Yu, Xingchao Liu, Yuan-Chen Guo, Ding Liang, Wanli Ouyang, et al. Triposg: High-fidelity 3d shape synthesis using large-scale rectified flow models.arXiv preprint arXiv:2502.06608, 2025. 1, 2
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[23]
Zhimin Li, Jianwei Zhang, Qin Lin, Jiangfeng Xiong, Yanxin Long, Xinchi Deng, Yingfang Zhang, Xingchao Liu, Minbin Huang, Zedong Xiao, et al. Hunyuan-dit: A powerful multi-resolution diffusion transformer with fine-grained chi- nese understanding.arXiv preprint arXiv:2405.08748, 2024. 2
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[24]
Microsoft coco: Common objects in context
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014. 6
2014
-
[25]
Flow Matching for Generative Modeling
Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximil- ian Nickel, and Matt Le. Flow matching for generative mod- eling.arXiv preprint arXiv:2210.02747, 2022. 2, 3
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[26]
Timestep embedding tells: It’s time to cache for video diffusion model
Feng Liu, Shiwei Zhang, Xiaofeng Wang, Yujie Wei, Haonan Qiu, Yuzhong Zhao, Yingya Zhang, Qixiang Ye, and Fang Wan. Timestep embedding tells: It’s time to cache for video diffusion model. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 7353–7363, 2025. 1, 3, 5, 7
2025
-
[27]
Faster diffu- sion via temporal attention decomposition.arXiv preprint arXiv:2404.02747, 2024
Haozhe Liu, Wentian Zhang, Jinheng Xie, Francesco Fac- cio, Mengmeng Xu, Tao Xiang, Mike Zheng Shou, Juan- Manuel Perez-Rua, and J ¨urgen Schmidhuber. Faster diffu- sion via temporal attention decomposition.arXiv preprint arXiv:2404.02747, 2024. 1, 3, 5
-
[28]
Instaflow: One step is enough for high-quality diffusion- based text-to-image generation
Xingchao Liu, Xiwen Zhang, Jianzhu Ma, Jian Peng, et al. Instaflow: One step is enough for high-quality diffusion- based text-to-image generation. InThe Twelfth International Conference on Learning Representations, 2023. 2
2023
-
[29]
Latent Consistency Models: Synthesizing High-Resolution Images with Few-Step Inference
Simian Luo, Yiqin Tan, Longbo Huang, Jian Li, and Hang Zhao. Latent consistency models: Synthesizing high- resolution images with few-step inference.arXiv preprint arXiv:2310.04378, 2023. 2
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[30]
Deepcache: Accelerating diffusion models for free
Xinyin Ma, Gongfan Fang, and Xinchao Wang. Deepcache: Accelerating diffusion models for free. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 15762–15772, 2024. 1, 3
2024
-
[31]
On distillation of guided diffusion models
Chenlin Meng, Robin Rombach, Ruiqi Gao, Diederik Kingma, Stefano Ermon, Jonathan Ho, and Tim Salimans. On distillation of guided diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14297–14306, 2023. 2
2023
-
[32]
GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models
Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. Glide: Towards photorealistic image generation and editing with text-guided diffusion models.arXiv preprint arXiv:2112.10741, 2021. 2
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[33]
Scalable diffusion models with transformers
William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF inter- national conference on computer vision, pages 4195–4205,
-
[34]
Learning transferable visual models from natural language supervi- sion
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 6
2021
-
[35]
Hierarchical Text-Conditional Image Generation with CLIP Latents
Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image gener- ation with clip latents.arXiv preprint arXiv:2204.06125, 1 (2):3, 2022. 3
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[36]
High-resolution image synthesis with latent diffusion models
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022. 2
2022
-
[37]
Photorealistic text-to-image diffusion models with deep language understanding.Advances in neural information processing systems, 35:36479–36494, 2022
Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to-image diffusion models with deep language understanding.Advances in neural information processing systems, 35:36479–36494, 2022. 2, 3
2022
-
[38]
Progressive Distillation for Fast Sampling of Diffusion Models
Tim Salimans and Jonathan Ho. Progressive distillation for fast sampling of diffusion models.arXiv preprint arXiv:2202.00512, 2022. 2
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[39]
Adversarial diffusion distillation
Axel Sauer, Dominik Lorenz, Andreas Blattmann, and Robin Rombach. Adversarial diffusion distillation. InEuropean Conference on Computer Vision, pages 87–103. Springer,
-
[40]
Fora: Fast-forward caching in diffusion transformer acceleration.arXiv preprint arXiv:2407.01425,
Pratheba Selvaraju, Tianyu Ding, Tianyi Chen, Ilya Zharkov, and Luming Liang. Fora: Fast-forward caching in diffusion transformer acceleration.arXiv preprint arXiv:2407.01425,
-
[41]
Post-training quantization on diffusion models
Yuzhang Shang, Zhihang Yuan, Bin Xie, Bingzhe Wu, and Yan Yan. Post-training quantization on diffusion models. In Proceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 1972–1981, 2023. 2
1972
-
[42]
Temporal dynamic quantization for dif- fusion models.Advances in neural information processing systems, 36:48686–48698, 2023
Junhyuk So, Jungwon Lee, Daehyun Ahn, Hyungjun Kim, and Eunhyeok Park. Temporal dynamic quantization for dif- fusion models.Advances in neural information processing systems, 36:48686–48698, 2023. 2
2023
-
[43]
Denoising Diffusion Implicit Models
Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models.arXiv preprint arXiv:2010.02502, 2020. 2
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[44]
Improved Techniques for Training Consistency Models
Yang Song and Prafulla Dhariwal. Improved tech- niques for training consistency models.arXiv preprint arXiv:2310.14189, 2023. 2
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[45]
Consistency models
Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever. Consistency models. 2023. 2
2023
-
[46]
Wan: Open and Advanced Large-Scale Video Generative Models
Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video gen- erative models.arXiv preprint arXiv:2503.20314, 2025. 1, 2, 5, 7
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[47]
Videolcm: Video latent consistency model.arXiv preprint arXiv:2312.09109,
Xiang Wang, Shiwei Zhang, Han Zhang, Yu Liu, Yingya Zhang, Changxin Gao, and Nong Sang. Videolcm: Video latent consistency model.arXiv preprint arXiv:2312.09109,
-
[48]
Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al. Qwen-image technical report.arXiv preprint arXiv:2508.02324, 2025. 1, 2, 5, 7
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[49]
Structured 3d latents for scalable and versatile 3d gen- eration
Jianfeng Xiang, Zelong Lv, Sicheng Xu, Yu Deng, Ruicheng Wang, Bowen Zhang, Dong Chen, Xin Tong, and Jiaolong Yang. Structured 3d latents for scalable and versatile 3d gen- eration. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 21469–21480, 2025. 1, 2
2025
-
[50]
Imagere- ward: Learning and evaluating human preferences for text- to-image generation.Advances in Neural Information Pro- cessing Systems, 36:15903–15935, 2023
Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai Li, Ming Ding, Jie Tang, and Yuxiao Dong. Imagere- ward: Learning and evaluating human preferences for text- to-image generation.Advances in Neural Information Pro- cessing Systems, 36:15903–15935, 2023. 5
2023
-
[51]
Chongjie Ye, Yushuang Wu, Ziteng Lu, Jiahao Chang, Xi- aoyang Guo, Jiaqing Zhou, Hao Zhao, and Xiaoguang Han. Hi3dgen: High-fidelity 3d geometry generation from images via normal bridging.arXiv preprint arXiv:2503.22236, 3:2,
-
[52]
The unreasonable effectiveness of deep features as a perceptual metric
Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shecht- man, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. InProceedings of the IEEE conference on computer vision and pattern recogni- tion, pages 586–595, 2018. 5
2018
-
[53]
Real-time video generation with pyramid attention broad- cast.arXiv preprint arXiv:2408.12588, 2024
Xuanlei Zhao, Xiaolong Jin, Kai Wang, and Yang You. Real-time video generation with pyramid attention broad- cast.arXiv preprint arXiv:2408.12588, 2024. 1, 3, 5, 7
-
[54]
Yang Zhao, Yanwu Xu, Zhisheng Xiao, and Tingbo Hou. Mobilediffusion: Subsecond text-to-image generation on mobile devices.arXiv preprint arXiv:2311.16567, 2(3):4,
-
[55]
Open-Sora: Democratizing Efficient Video Production for All
Zangwei Zheng, Xiangyu Peng, Tianji Yang, Chenhui Shen, Shenggui Li, Hongxin Liu, Yukun Zhou, Tianyi Li, and Yang You. Open-sora: Democratizing efficient video production for all.arXiv preprint arXiv:2412.20404, 2024. 2
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[56]
Less is enough: Training-free video diffusion acceleration via runtime-adaptive caching
Xin Zhou, Dingkang Liang, Kaijin Chen, Tianrui Feng, Xiwu Chen, Hongkai Lin, Yikang Ding, Feiyang Tan, Heng- shuang Zhao, and Xiang Bai. Less is enough: Training-free video diffusion acceleration via runtime-adaptive caching. arXiv preprint arXiv:2507.02860, 2025. 1, 3, 5, 7
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.