Recognition: 2 theorem links
· Lean TheoremWhen Numbers Speak: Aligning Textual Numerals and Visual Instances in Text-to-Video Diffusion Models
Pith reviewed 2026-05-10 18:36 UTC · model grok-4.3
The pith
A training-free method uses attention heads to derive and correct object counts in text-to-video diffusion outputs.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
NUMINA identifies prompt-layout inconsistencies by selecting discriminative self- and cross-attention heads to derive a countable latent layout. It then refines this layout conservatively and modulates cross-attention to guide regeneration, yielding higher numerical fidelity in the final video without any model retraining.
What carries the argument
The countable latent layout extracted from selected discriminative attention heads, which supplies the structural signal used to modulate cross-attention and enforce correct object quantities.
If this is right
- Counting accuracy rises by up to 7.4 percent on the 1.3B model and by 4.9 to 5.5 percent on the 5B and 14B models.
- CLIP text-video alignment improves while temporal consistency across frames is maintained.
- The structural guidance complements existing techniques such as prompt rewriting and seed sampling.
- The same identify-then-guide pattern works across model scales without retraining.
Where Pith is reading between the lines
- Internal attention maps in diffusion models appear to encode usable layout information that could be applied to other alignment problems such as spatial relations or action ordering.
- Because the method needs no training, it can be stacked with other lightweight post-hoc corrections to handle more complex prompts.
- Further tests on longer videos or prompts with multiple overlapping counts would reveal how far the latent-layout signal generalizes.
Load-bearing premise
Selecting particular attention heads will reliably produce a usable countable layout from the prompt without adding new errors or requiring per-model tuning.
What would settle it
Applying NUMINA to CountBench and finding no gain, or a drop, in object-counting accuracy relative to the unmodified baseline models would show the central claim is incorrect.
Figures
read the original abstract
Text-to-video diffusion models have enabled open-ended video synthesis, but often struggle with generating the correct number of objects specified in a prompt. We introduce NUMINA , a training-free identify-then-guide framework for improved numerical alignment. NUMINA identifies prompt-layout inconsistencies by selecting discriminative self- and cross-attention heads to derive a countable latent layout. It then refines this layout conservatively and modulates cross-attention to guide regeneration. On the introduced CountBench, NUMINA improves counting accuracy by up to 7.4% on Wan2.1-1.3B, and by 4.9% and 5.5% on 5B and 14B models, respectively. Furthermore, CLIP alignment is improved while maintaining temporal consistency. These results demonstrate that structural guidance complements seed search and prompt enhancement, offering a practical path toward count-accurate text-to-video diffusion. The code is available at https://github.com/H-EmbodVis/NUMINA.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces NUMINA, a training-free identify-then-guide framework for text-to-video diffusion models. It selects discriminative self- and cross-attention heads to derive a countable latent layout from the prompt, conservatively refines this layout, and modulates cross-attention during regeneration to better align generated object counts with textual numerals. On the introduced CountBench, it reports counting accuracy gains of up to 7.4% on Wan2.1-1.3B and 4.9%/5.5% on 5B/14B models, plus improved CLIP alignment with preserved temporal consistency, positioning the method as complementary to seed search and prompt engineering.
Significance. If validated, the work provides a practical, training-free structural guidance technique that leverages internal attention maps for count correction in T2V models, a persistent weakness in current systems. The new CountBench benchmark and public code release are positive contributions for reproducibility and future evaluation. The modest reported gains indicate incremental rather than transformative impact, but the approach could integrate usefully with other inference-time methods if the head-selection step proves robust.
major comments (3)
- [Experiments] Experiments section (results on CountBench): The reported accuracy improvements (7.4% on 1.3B, 4.9% and 5.5% on larger models) are presented without specifying CountBench size, exact counting metric and protocol, baselines (including whether they include prior attention-guidance or prompt-engineering methods), number of seeds per prompt, or any error bars/statistical tests. This information is load-bearing for the central empirical claim.
- [Method] Method section on discriminative head selection: The pipeline's first step assumes that a small subset of self- and cross-attention heads can be automatically identified to yield a reliable 'countable latent layout' that can be thresholded or clustered without high false-positive/negative rates. No ablation is described testing sensitivity to object category, spatial arrangement, prompt complexity, or model scale; if this step errs on even 10-15% of cases, the modest net gains would be erased or reversed.
- [Method] Method section on conservative refinement and cross-attention modulation: The precise algorithms, thresholds, and parameters for layout refinement and attention modulation are not fully specified (e.g., how 'conservative' is operationalized, how temporal consistency is enforced across frames). This makes it impossible to reproduce the guidance step or diagnose why CLIP improves while temporal metrics remain stable.
minor comments (3)
- [Experiments] The paper should include qualitative failure cases (prompts where NUMINA leaves the count unchanged or worsens it) to illustrate the limits of the head-selection heuristic.
- [Method] Notation for attention heads, latent layouts, and modulation operations should be introduced with explicit equations or pseudocode early in the method section for clarity.
- [Introduction] Related-work discussion could more explicitly contrast NUMINA with prior attention-map guidance techniques in diffusion models (e.g., those using cross-attention for layout control).
Simulated Author's Rebuttal
Thank you for the constructive and detailed review of our manuscript. We address each major comment point by point below. We will revise the manuscript to incorporate the requested clarifications, details, and additional analyses where they strengthen the presentation of our work.
read point-by-point responses
-
Referee: [Experiments] Experiments section (results on CountBench): The reported accuracy improvements (7.4% on 1.3B, 4.9% and 5.5% on larger models) are presented without specifying CountBench size, exact counting metric and protocol, baselines (including whether they include prior attention-guidance or prompt-engineering methods), number of seeds per prompt, or any error bars/statistical tests. This information is load-bearing for the central empirical claim.
Authors: We agree that these experimental details are necessary to fully substantiate the central claims. In the revised manuscript we will expand the Experiments section to report the exact size of CountBench, the precise counting metric and evaluation protocol (including how objects are detected and counted in generated videos), a complete list of baselines that explicitly includes prior attention-guidance and prompt-engineering methods, the number of seeds evaluated per prompt, and error bars with appropriate statistical tests. These additions will allow readers to assess the reported gains rigorously. revision: yes
-
Referee: [Method] Method section on discriminative head selection: The pipeline's first step assumes that a small subset of self- and cross-attention heads can be automatically identified to yield a reliable 'countable latent layout' that can be thresholded or clustered without high false-positive/negative rates. No ablation is described testing sensitivity to object category, spatial arrangement, prompt complexity, or model scale; if this step errs on even 10-15% of cases, the modest net gains would be erased or reversed.
Authors: The discriminative head selection procedure is described in Section 3.2, where heads are chosen according to their attention focus on numerically relevant tokens. We acknowledge that dedicated ablations on robustness are absent from the current version. In the revision we will add an ablation study that systematically varies object category, spatial arrangement, prompt complexity, and model scale, reporting the impact on layout quality and final counting accuracy. This will quantify the reliability of the step and show how the subsequent conservative refinement limits error propagation. revision: yes
-
Referee: [Method] Method section on conservative refinement and cross-attention modulation: The precise algorithms, thresholds, and parameters for layout refinement and attention modulation are not fully specified (e.g., how 'conservative' is operationalized, how temporal consistency is enforced across frames). This makes it impossible to reproduce the guidance step or diagnose why CLIP improves while temporal metrics remain stable.
Authors: We will revise the Method section to supply the missing algorithmic details, including the exact thresholds, parameter values, and pseudocode for conservative layout refinement and cross-attention modulation. We will explicitly define the operationalization of 'conservative' refinement and describe the frame-to-frame consistency enforcement mechanism. These clarifications will make the guidance procedure fully reproducible from the text and will help explain the observed CLIP gains alongside stable temporal metrics. The released code already implements these steps; the paper revision will align the description with the implementation. revision: yes
Circularity Check
No significant circularity; empirical training-free method with independent content
full rationale
The paper presents NUMINA as a training-free identify-then-guide framework that selects discriminative self- and cross-attention heads to derive a countable latent layout, refines it conservatively, and modulates cross-attention for regeneration. No equations, derivations, or fitted parameters are described that reduce the reported accuracy gains (e.g., up to 7.4% on CountBench) to the inputs by construction. The approach is benchmarked empirically on introduced data and multiple model scales while maintaining other metrics, with no load-bearing self-citations, uniqueness theorems, or ansatzes imported from prior author work. The central claims rest on observable improvements from the pipeline rather than self-referential definitions or renamed known results.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Discriminative self- and cross-attention heads encode a countable latent layout consistent with the text prompt
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
NUMINA identifies prompt-layout inconsistencies by selecting discriminative self- and cross-attention heads to derive a countable latent layout... cluster-based algorithm... layout refinement... cross-attention modulation
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We select the head with the highest instance separability... S(SA_h) = S1 + S2 + γ S3
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Introducing claude sonnet 4.5.https:// www.anthropic.com/news/claude- sonnet- 4- 5, 2025
Anthropic. Introducing claude sonnet 4.5.https:// www.anthropic.com/news/claude- sonnet- 4- 5, 2025. 6
2025
-
[2]
Uniedit: A unified tuning- free framework for video motion and appearance editing
Jianhong Bai, Tianyu He, Yuchi Wang, Junliang Guo, Haoji Hu, Zuozhu Liu, and Jiang Bian. Uniedit: A unified tuning- free framework for video motion and appearance editing. In Proc. of ACM Multimedia, pages 10171–10180, 2025. 2
2025
-
[3]
Lumiere: A space-time diffu- sion model for video generation
Omer Bar-Tal, Hila Chefer, Omer Tov, Charles Her- rmann, Roni Paiss, Shiran Zada, Ariel Ephrat, Junhwa Hur, Guanghui Liu, Amit Raj, et al. Lumiere: A space-time diffu- sion model for video generation. InSIGGRAPH Asia Conf., pages 1–11, 2024. 1
2024
-
[4]
Make it count: Text-to-image generation with an accurate number of objects
Lital Binyamin, Yoad Tewel, Hilit Segev, Eran Hirsch, Royi Rassin, and Gal Chechik. Make it count: Text-to-image generation with an accurate number of objects. InProc. of IEEE Intl. Conf. on Computer Vision and Pattern Recogni- tion, pages 13242–13251, 2025. 3
2025
-
[5]
Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets
Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023. 1
work page internal anchor Pith review arXiv 2023
-
[6]
Align your latents: High-resolution video synthesis with la- tent diffusion models
Andreas Blattmann, Robin Rombach, Huan Ling, Tim Dock- horn, Seung Wook Kim, Sanja Fidler, and Karsten Kreis. Align your latents: High-resolution video synthesis with la- tent diffusion models. InProc. of IEEE Intl. Conf. on Com- puter Vision and Pattern Recognition, pages 22563–22575,
-
[7]
Ditctrl: Exploring attention control in multi-modal dif- fusion transformer for tuning-free multi-prompt longer video generation
Minghong Cai, Xiaodong Cun, Xiaoyu Li, Wenze Liu, Zhaoyang Zhang, Yong Zhang, Ying Shan, and Xiangyu Yue. Ditctrl: Exploring attention control in multi-modal dif- fusion transformer for tuning-free multi-prompt longer video generation. InProc. of IEEE Intl. Conf. on Computer Vision and Pattern Recognition, pages 7763–7772, 2025. 3
2025
-
[8]
Emerg- ing properties in self-supervised vision transformers
Mathilde Caron, Hugo Touvron, Ishan Misra, Herv ´e J´egou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerg- ing properties in self-supervised vision transformers. In Proc. of IEEE Intl. Conf. on Computer Vision, pages 9650– 9660, 2021. 12
2021
-
[9]
VideoCrafter1: Open Diffusion Models for High-Quality Video Generation
Haoxin Chen, Menghan Xia, Yingqing He, Yong Zhang, Xiaodong Cun, Shaoshu Yang, Jinbo Xing, Yaofang Liu, Qifeng Chen, Xintao Wang, et al. Videocrafter1: Open diffusion models for high-quality video generation.arXiv preprint arXiv:2310.19512, 2023. 2
work page internal anchor Pith review arXiv 2023
-
[10]
Videocrafter2: Overcoming data limitations for high-quality video diffusion models
Haoxin Chen, Yong Zhang, Xiaodong Cun, Menghan Xia, Xintao Wang, Chao Weng, and Ying Shan. Videocrafter2: Overcoming data limitations for high-quality video diffusion models. InProc. of IEEE Intl. Conf. on Computer Vision and Pattern Recognition, pages 7310–7320, 2024. 1
2024
-
[11]
Gentron: Diffusion transformers for image and video generation
Shoufa Chen, Mengmeng Xu, Jiawei Ren, Yuren Cong, Sen He, Yanping Xie, Animesh Sinha, Ping Luo, Tao Xiang, and Juan-Manuel Perez-Rua. Gentron: Diffusion transformers for image and video generation. InProc. of IEEE Intl. Conf. on Computer Vision and Pattern Recognition, pages 6441– 6451, 2024. 1
2024
-
[12]
Segment and Track Anything.arXiv preprint arXiv:2305.06558, 2023
Yangming Cheng, Liulei Li, Yuanyou Xu, Xiaodi Li, Zongxin Yang, Wenguan Wang, and Yi Yang. Segment and track anything.arXiv preprint arXiv:2305.06558, 2023. 3
-
[13]
Mean shift: A robust ap- proach toward feature space analysis.IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(5):603–619,
Dorin Comaniciu and Peter Meer. Mean shift: A robust ap- proach toward feature space analysis.IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(5):603–619,
-
[14]
Homogen: Enhanced video inpainting via homography propagation and diffusion
Ding Ding, Yueming Pan, Ruoyu Feng, Qi Dai, Kai Qiu, Jianmin Bao, Chong Luo, and Zhenzhong Chen. Homogen: Enhanced video inpainting via homography propagation and diffusion. InProc. of IEEE Intl. Conf. on Computer Vision and Pattern Recognition, pages 22953–22962, 2025. 2
2025
-
[15]
A density-based algorithm for discovering clusters in large spatial databases with noise
Martin Ester, Hans-Peter Kriegel, J ¨org Sander, Xiaowei Xu, et al. A density-based algorithm for discovering clusters in large spatial databases with noise. InProc. ACM SIGKDD Int. Conf. Knowledge Discovery & Data Mining, pages 226– 231, 1996. 5
1996
-
[16]
Viewpoint: Panoramic video gen- eration with pretrained diffusion models
Zixun Fang, Kai Zhu, Zhiheng Liu, Yu Liu, Wei Zhai, Yang Cao, and Zheng-Jun Zha. Viewpoint: Panoramic video gen- eration with pretrained diffusion models. InProc. of Ad- vances in Neural Information Processing Systems, 2025. 1
2025
-
[17]
The devil is in the prompts: Retrieval-augmented prompt optimization for text-to-video generation
Bingjie Gao, Xinyu Gao, Xiaoxue Wu, Yujie Zhou, Yu Qiao, Li Niu, Xinyuan Chen, and Yaohui Wang. The devil is in the prompts: Retrieval-augmented prompt optimization for text-to-video generation. InProc. of IEEE Intl. Conf. on Computer Vision and Pattern Recognition, pages 3173– 3183, 2025. 1
2025
-
[18]
Videoswap: Customized video subject swapping with interactive semantic point cor- respondence
Yuchao Gu, Yipin Zhou, Bichen Wu, Licheng Yu, Jia-Wei Liu, Rui Zhao, Jay Zhangjie Wu, David Junhao Zhang, Mike Zheng Shou, and Kevin Tang. Videoswap: Customized video subject swapping with interactive semantic point cor- respondence. InProc. of IEEE Intl. Conf. on Computer Vi- sion and Pattern Recognition, pages 7621–7630, 2024. 2
2024
-
[19]
Keyframe-guided creative video inpainting
Yuwei Guo, Ceyuan Yang, Anyi Rao, Chenlin Meng, Omer Bar-Tal, Shuangrui Ding, Maneesh Agrawala, Dahua Lin, and Bo Dai. Keyframe-guided creative video inpainting. In Proc. of IEEE Intl. Conf. on Computer Vision and Pattern Recognition, pages 13009–13020, 2025. 2
2025
-
[20]
LTX-Video: Realtime Video Latent Diffusion
Yoav HaCohen, Nisan Chiprut, Benny Brazowski, Daniel Shalem, Dudu Moshe, Eitan Richardson, Eran Levin, Guy Shiran, Nir Zabari, Ori Gordon, et al. Ltx-video: Realtime video latent diffusion.arXiv preprint arXiv:2501.00103,
work page internal anchor Pith review arXiv
-
[21]
Clipscore: A reference-free evaluation met- ric for image captioning
Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. Clipscore: A reference-free evaluation met- ric for image captioning. InProc. Conference on Empirical Methods in Natural Language Processing, pages 7514–7528,
-
[22]
Imagen Video: High Definition Video Generation with Diffusion Models
Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey Gritsenko, Diederik P Kingma, Ben Poole, Mohammad Norouzi, David J Fleet, et al. Imagen video: High definition video generation with diffusion mod- els.arXiv preprint arXiv:2210.02303, 2022. 2
work page internal anchor Pith review arXiv 2022
-
[23]
Video dif- fusion models
Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J Fleet. Video dif- fusion models. InProc. of Advances in Neural Information Processing Systems, pages 8633–8646, 2022. 2 9
2022
-
[24]
Cogvideo: Large-scale pretraining for text-to-video generation via transformers
Wenyi Hong, Ming Ding, Wendi Zheng, Xinghan Liu, and Jie Tang. Cogvideo: Large-scale pretraining for text-to-video generation via transformers. InProc. of Intl. Conf. on Learn- ing Representations, 2023. 2
2023
-
[25]
Vbench: Comprehensive bench- mark suite for video generative models
Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, et al. Vbench: Comprehensive bench- mark suite for video generative models. InProc. of IEEE Intl. Conf. on Computer Vision and Pattern Recognition, pages 21807–21818, 2024. 6, 12
2024
-
[26]
Re- thinking fid: Towards a better evaluation metric for image generation
Sadeep Jayasumana, Srikumar Ramalingam, Andreas Veit, Daniel Glasner, Ayan Chakrabarti, and Sanjiv Kumar. Re- thinking fid: Towards a better evaluation metric for image generation. InProc. of IEEE Intl. Conf. on Computer Vision and Pattern Recognition, pages 9307–9315, 2024. 6
2024
-
[27]
Ground-a-video: Zero- shot grounded video editing using text-to-image diffusion models
Hyeonho Jeong and Jong Chul Ye. Ground-a-video: Zero- shot grounded video editing using text-to-image diffusion models. InProc. of Intl. Conf. on Learning Representations,
-
[28]
Free2guide: Training-free text-to-video alignment using im- age lvlm
Jaemin Kim, Bryan Sangwoo Kim, and Jong Chul Ye. Free2guide: Training-free text-to-video alignment using im- age lvlm. InProc. of IEEE Intl. Conf. on Computer Vision, pages 17920–17929, 2025. 1
2025
-
[29]
Segment any- thing
Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer White- head, Alexander C Berg, Wan-Yen Lo, et al. Segment any- thing. InProc. of IEEE Intl. Conf. on Computer Vision, pages 4015–4026, 2023. 3
2023
-
[30]
HunyuanVideo: A Systematic Framework For Large Video Generative Models
Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024. 2
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[31]
Generative om- nimatte: Learning to decompose video into layers
Yao-Chih Lee, Erika Lu, Sarah Rumbley, Michal Geyer, Jia- Bin Huang, Tali Dekel, and Forrester Cole. Generative om- nimatte: Learning to decompose video into layers. InProc. of IEEE Intl. Conf. on Computer Vision and Pattern Recog- nition, pages 12522–12532, 2025. 2
2025
-
[32]
Drivingdiffusion: Layout-guided multi-view driving scenarios video genera- tion with latent diffusion model
Xiaofan Li, Yifu Zhang, and Xiaoqing Ye. Drivingdiffusion: Layout-guided multi-view driving scenarios video genera- tion with latent diffusion model. InProc. of European Con- ference on Computer Vision, pages 469–485, 2024. 2
2024
-
[33]
Xiaofan Li, Yanpeng Sun, Chenming Wu, Fan Duan, YuAn Wang, Weihao Bo, Yumeng Zhang, and Dingkang Liang. Video4edit: Viewing image editing as a degenerate tempo- ral process.arXiv preprint arXiv:2511.18131, 2025. 2
-
[34]
Driverse: Navigation world model for driving simulation via multi- modal trajectory prompting and motion alignment
Xiaofan Li, Chenming Wu, Zhao Yang, Zhihao Xu, Yumeng Zhang, Dingkang Liang, Ji Wan, and Jun Wang. Driverse: Navigation world model for driving simulation via multi- modal trajectory prompting and motion alignment. InProc. of ACM Multimedia, pages 9753–9762, 2025. 2
2025
-
[35]
arXiv preprint arXiv:2501.10018 (2025) 4, 10, 3
Xiaowen Li, Haolan Xue, Peiran Ren, and Liefeng Bo. Dif- fueraser: A diffusion model for video inpainting.arXiv preprint arXiv:2501.10018, 2025. 2
-
[36]
Fvar: Visual autoregressive modeling via next focus prediction
Xiaofan Li, Chenming Wu, Yanpeng Sun, Jiaming Zhou, Delin Qu, Yansong Qu, Weihao Bo, Haibao Yu, and Dingkang Liang. Fvar: Visual autoregressive modeling via next focus prediction. InProc. of IEEE Intl. Conf. on Com- puter Vision and Pattern Recognition, 2026. 1
2026
-
[37]
Zhimin Li, Jianwei Zhang, Qin Lin, Jiangfeng Xiong, Yanxin Long, Xinchi Deng, Yingfang Zhang, Xingchao Liu, Minbin Huang, Zedong Xiao, et al. Hunyuan-dit: A powerful multi-resolution diffusion transformer with fine-grained chi- nese understanding.arXiv preprint arXiv:2405.08748, 2024. 1, 3
-
[38]
Transcrowd: weakly-supervised crowd counting with transformers.Science China Information Sciences, 65(6): 160104, 2022
Dingkang Liang, Xiwu Chen, Wei Xu, Yu Zhou, and Xiang Bai. Transcrowd: weakly-supervised crowd counting with transformers.Science China Information Sciences, 65(6): 160104, 2022. 3
2022
-
[39]
An end-to-end transformer model for crowd localization
Dingkang Liang, Wei Xu, and Xiang Bai. An end-to-end transformer model for crowd localization. InProc. of Euro- pean Conference on Computer Vision, pages 38–54, 2022
2022
-
[40]
Fo- cal inverse distance transform maps for crowd localization
Dingkang Liang, Wei Xu, Yingying Zhu, and Yu Zhou. Fo- cal inverse distance transform maps for crowd localization. IEEE Transactions on Multimedia, 25:6040–6052, 2022
2022
-
[41]
Crowdclip: Unsupervised crowd counting via vision-language model
Dingkang Liang, Jiahao Xie, Zhikang Zou, Xiaoqing Ye, Wei Xu, and Xiang Bai. Crowdclip: Unsupervised crowd counting via vision-language model. InProc. of IEEE Intl. Conf. on Computer Vision and Pattern Recognition, pages 2893–2903, 2023
2023
-
[42]
Sood++: Leveraging unlabeled data to boost oriented object detection.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025
Dingkang Liang, Wei Hua, Chunsheng Shi, Zhikang Zou, Xiaoqing Ye, and Xiang Bai. Sood++: Leveraging unlabeled data to boost oriented object detection.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025. 3
2025
-
[43]
Flow matching for generative mod- eling
Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximil- ian Nickel, and Matt Le. Flow matching for generative mod- eling. InProc. of Intl. Conf. on Learning Representations,
-
[44]
Grounding dino: Marrying dino with grounded pre-training for open-set object detection
Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. InProc. of European Conference on Computer Vision, pages 38–55,
-
[45]
Video-p2p: Video editing with cross-attention control
Shaoteng Liu, Yuechen Zhang, Wenbo Li, Zhe Lin, and Jiaya Jia. Video-p2p: Video editing with cross-attention control. InProc. of IEEE Intl. Conf. on Computer Vision and Pattern Recognition, pages 8599–8608, 2024. 2
2024
-
[46]
Flow straight and fast: Learning to generate and transfer data with rectified flow
Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow. InProc. of Intl. Conf. on Learning Represen- tations, 2023. 3
2023
-
[47]
Evalcrafter: Benchmarking and eval- uating large video generation models
Yaofang Liu, Xiaodong Cun, Xuebo Liu, Xintao Wang, Yong Zhang, Haoxin Chen, Yang Liu, Tieyong Zeng, Raymond Chan, and Ying Shan. Evalcrafter: Benchmarking and eval- uating large video generation models. InProc. of IEEE Intl. Conf. on Computer Vision and Pattern Recognition, pages 22139–22149, 2024. 6
2024
-
[48]
Latte: La- tent diffusion transformer for video generation.Transactions on Machine Learning Research, 2025
Xin Ma, Yaohui Wang, Xinyuan Chen, Gengyun Jia, Ziwei Liu, Yuan-Fang Li, Cunjian Chen, and Yu Qiao. Latte: La- tent diffusion transformer for video generation.Transactions on Machine Learning Research, 2025. 2
2025
-
[49]
Dreamix: Video diffusion models are general video editors
Eyal Molad, Eliahu Horwitz, Dani Valevski, Alex Rav Acha, Yossi Matias, Yael Pritch, Yaniv Leviathan, and Yedid 10 Hoshen. Dreamix: Video diffusion models are general video editors.arXiv preprint arXiv:2302.01329, 2023. 2
-
[50]
Revideo: Remake a video with motion and content control
Chong Mou, Mingdeng Cao, Xintao Wang, Zhaoyang Zhang, Ying Shan, and Jian Zhang. Revideo: Remake a video with motion and content control. InProc. of Advances in Neural Information Processing Systems, pages 18481– 18505, 2024. 2
2024
-
[51]
Improved denoising diffusion probabilistic models
Alexander Quinn Nichol and Prafulla Dhariwal. Improved denoising diffusion probabilistic models. InProc. of Intl. Conf. on Machine Learning, pages 8162–8171, 2021. 3
2021
-
[52]
Introducing gpt-5.https://openai.com/ blog/introducing-gpt-5, 2025
OpenAI. Introducing gpt-5.https://openai.com/ blog/introducing-gpt-5, 2025. 6
2025
-
[53]
Scalable diffusion models with transformers
William Peebles and Saining Xie. Scalable diffusion models with transformers. InProc. of IEEE Intl. Conf. on Computer Vision, pages 4195–4205, 2023. 1, 2, 3
2023
-
[54]
Fatezero: Fus- ing attentions for zero-shot text-based video editing
Chenyang Qi, Xiaodong Cun, Yong Zhang, Chenyang Lei, Xintao Wang, Ying Shan, and Qifeng Chen. Fatezero: Fus- ing attentions for zero-shot text-based video editing. InProc. of IEEE Intl. Conf. on Computer Vision, pages 15932–15942,
-
[55]
Omnimattezero: Fast training-free omn- imatte with pre-trained video diffusion models
Dvir Samuel, Matan Levy, Nir Darshan, Gal Chechik, and Rami Ben-Ari. Omnimattezero: Fast training-free omn- imatte with pre-trained video diffusion models. InSIG- GRAPH Asia Conf., 2025. 2
2025
-
[56]
T2v-compbench: A comprehen- sive benchmark for compositional text-to-video generation
Kaiyue Sun, Kaiyi Huang, Xian Liu, Yue Wu, Zihan Xu, Zhenguo Li, and Xihui Liu. T2v-compbench: A comprehen- sive benchmark for compositional text-to-video generation. InProc. of IEEE Intl. Conf. on Computer Vision and Pattern Recognition, pages 8406–8416, 2025. 6, 12
2025
-
[57]
Towards online real-time memory-based video inpainting transformers
Guillaume Thiry, Hao Tang, Radu Timofte, and Luc Van Gool. Towards online real-time memory-based video inpainting transformers. InProc. of IEEE Intl. Conf. on Com- puter Vision and Pattern Recognition, pages 6035–6044,
-
[58]
Fvd: A new metric for video generation
Thomas Unterthiner, Sjoerd Van Steenkiste, Karol Kurach, Rapha¨el Marinier, Marcin Michalski, and Sylvain Gelly. Fvd: A new metric for video generation. InProc. of Intl. Conf. on Learning Representations Workshop, 2019. 6
2019
-
[59]
Wan: Open and Advanced Large-Scale Video Generative Models
Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video gen- erative models.arXiv preprint arXiv:2503.20314, 2025. 1, 2, 3, 6, 12, 13
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[60]
Towards transformer-based aligned gen- eration with self-coherence guidance
Shulei Wang, Wang Lin, Hai Huang, Hanting Wang, Sihang Cai, WenKang Han, Tao Jin, Jingyuan Chen, Jiacheng Sun, Jieming Zhu, et al. Towards transformer-based aligned gen- eration with self-coherence guidance. InProc. of IEEE Intl. Conf. on Computer Vision and Pattern Recognition, pages 18455–18464, 2025. 2
2025
-
[61]
Videocomposer: Compositional video synthesis with motion controllability
Xiang Wang, Hangjie Yuan, Shiwei Zhang, Dayou Chen, Ji- uniu Wang, Yingya Zhang, Yujun Shen, Deli Zhao, and Jin- gren Zhou. Videocomposer: Compositional video synthesis with motion controllability. InProc. of Advances in Neural Information Processing Systems, pages 7594–7611, 2023. 2
2023
-
[62]
Mo- tionbooth: Motion-aware customized text-to-video genera- tion
Jianzong Wu, Xiangtai Li, Yanhong Zeng, Jiangning Zhang, Qianyu Zhou, Yining Li, Yunhai Tong, and Kai Chen. Mo- tionbooth: Motion-aware customized text-to-video genera- tion. InProc. of Advances in Neural Information Processing Systems, pages 34322–34348, 2024. 1
2024
-
[63]
Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation
Jay Zhangjie Wu, Yixiao Ge, Xintao Wang, Stan Weixian Lei, Yuchao Gu, Yufei Shi, Wynne Hsu, Ying Shan, Xiaohu Qie, and Mike Zheng Shou. Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation. In Proc. of IEEE Intl. Conf. on Computer Vision, pages 7623– 7633, 2023. 2
2023
-
[64]
Towards a better metric for text-to-video generation.arXiv preprint arXiv:2401.07781,
Jay Zhangjie Wu, Guian Fang, Haoning Wu, Xintao Wang, Yixiao Ge, Xiaodong Cun, David Junhao Zhang, Jia-Wei Liu, Yuchao Gu, Rui Zhao, et al. Towards a better metric for text-to-video generation.arXiv preprint arXiv:2401.07781,
-
[65]
Draganything: Motion control for any- thing using entity representation
Weijia Wu, Zhuang Li, Yuchao Gu, Rui Zhao, Yefei He, David Junhao Zhang, Mike Zheng Shou, Yan Li, Tingting Gao, and Di Zhang. Draganything: Motion control for any- thing using entity representation. InProc. of European Con- ference on Computer Vision, pages 331–348, 2024. 2
2024
-
[66]
Vtoonify: Controllable high-resolution portrait video style transfer.ACM Transactions ON Graphics, 41(6):1–15, 2022
Shuai Yang, Liming Jiang, Ziwei Liu, and Chen Change Loy. Vtoonify: Controllable high-resolution portrait video style transfer.ACM Transactions ON Graphics, 41(6):1–15, 2022. 2
2022
-
[67]
Motion- guided latent diffusion for temporally consistent real-world video super-resolution
Xi Yang, Chenhang He, Jianqi Ma, and Lei Zhang. Motion- guided latent diffusion for temporally consistent real-world video super-resolution. InProc. of European Conference on Computer Vision, pages 224–242, 2024. 1
2024
-
[68]
Videograin: Modulating space-time attention for multi- grained video editing
Xiangpeng Yang, Linchao Zhu, Hehe Fan, and Yi Yang. Videograin: Modulating space-time attention for multi- grained video editing. InProc. of Intl. Conf. on Learning Representations, 2025. 2
2025
-
[69]
Zhao Yang, Zezhong Qian, Xiaofan Li, Weixiang Xu, Gongpeng Zhao, Ruohong Yu, Lingsi Zhu, and Longjun Liu. Dualdiff+: Dual-branch diffusion for high-fidelity video generation with reward guidance.arXiv preprint arXiv:2503.03689, 2025. 2
-
[70]
Cogvideox: Text-to-video diffusion models with an expert transformer
Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiao- han Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer. InProc. of Intl. Conf. on Learning Representations, 2025. 12, 13
2025
-
[71]
Stylemaster: Stylize your video with artistic generation and translation
Zixuan Ye, Huijuan Huang, Xintao Wang, Pengfei Wan, Di Zhang, and Wenhan Luo. Stylemaster: Stylize your video with artistic generation and translation. InProc. of IEEE Intl. Conf. on Computer Vision and Pattern Recognition, pages 2630–2640, 2025. 2
2025
-
[72]
Towards precise scaling laws for video diffusion transformers
Yuanyang Yin, Yaqi Zhao, Mingwu Zheng, Ke Lin, Jiarong Ou, Rui Chen, Victor Shea-Jay Huang, Jiahao Wang, Xin Tao, Pengfei Wan, et al. Towards precise scaling laws for video diffusion transformers. InProc. of IEEE Intl. Conf. on Computer Vision and Pattern Recognition, pages 18155– 18165, 2025. 2
2025
-
[73]
arXiv preprint arXiv:2507.02860 (2025) 4 1.x-Distill 19 Appendix Table of Contents 1 Introduction
Xin Zhou, Dingkang Liang, Kaijin Chen, Tianrui Feng, Xiwu Chen, Hongkai Lin, Yikang Ding, Feiyang Tan, Heng- shuang Zhao, and Xiang Bai. Less is enough: Training-free video diffusion acceleration via runtime-adaptive caching. arXiv preprint arXiv:2507.02860, 2025. 2, 8 11 When Numbers Speak: Aligning Textual Numerals and Visual Instances in Text-to-Video ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.