pith. machine review for the scientific record. sign in

arxiv: 2604.08546 · v1 · submitted 2026-04-09 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

When Numbers Speak: Aligning Textual Numerals and Visual Instances in Text-to-Video Diffusion Models

Authors on Pith no claims yet

Pith reviewed 2026-05-10 18:36 UTC · model grok-4.3

classification 💻 cs.CV
keywords text-to-video diffusionnumerical alignmentobject countingattention headslatent layouttraining-free guidanceCountBench
0
0 comments X

The pith

A training-free method uses attention heads to derive and correct object counts in text-to-video diffusion outputs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Text-to-video diffusion models frequently produce videos with the wrong number of objects specified in a text prompt. NUMINA addresses this by scanning the model's self- and cross-attention layers for heads that reveal a rough spatial layout of objects implied by the prompt. It refines that layout in a conservative way and then modulates the cross-attention maps during generation to steer the model toward the correct count. The approach raises counting accuracy on the new CountBench dataset by as much as 7.4 percent on a 1.3 billion parameter model and roughly 5 percent on larger models, while also improving text-video alignment scores and preserving frame-to-frame consistency. A reader should care because reliable quantity control is a basic requirement for any practical video generation tool.

Core claim

NUMINA identifies prompt-layout inconsistencies by selecting discriminative self- and cross-attention heads to derive a countable latent layout. It then refines this layout conservatively and modulates cross-attention to guide regeneration, yielding higher numerical fidelity in the final video without any model retraining.

What carries the argument

The countable latent layout extracted from selected discriminative attention heads, which supplies the structural signal used to modulate cross-attention and enforce correct object quantities.

If this is right

  • Counting accuracy rises by up to 7.4 percent on the 1.3B model and by 4.9 to 5.5 percent on the 5B and 14B models.
  • CLIP text-video alignment improves while temporal consistency across frames is maintained.
  • The structural guidance complements existing techniques such as prompt rewriting and seed sampling.
  • The same identify-then-guide pattern works across model scales without retraining.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Internal attention maps in diffusion models appear to encode usable layout information that could be applied to other alignment problems such as spatial relations or action ordering.
  • Because the method needs no training, it can be stacked with other lightweight post-hoc corrections to handle more complex prompts.
  • Further tests on longer videos or prompts with multiple overlapping counts would reveal how far the latent-layout signal generalizes.

Load-bearing premise

Selecting particular attention heads will reliably produce a usable countable layout from the prompt without adding new errors or requiring per-model tuning.

What would settle it

Applying NUMINA to CountBench and finding no gain, or a drop, in object-counting accuracy relative to the unmodified baseline models would show the central claim is incorrect.

Figures

Figures reproduced from arXiv: 2604.08546 by Dingkang Liang, Xiang Bai, Xiaofan Li, Xin Zhou, Xiwu Chen, Yu Chen, Zhengyang Sun.

Figure 1
Figure 1. Figure 1: We present NUMINA, a training-free framework that alleviates the misalignment between precise numerals and visual instances in text-to-video diffusion models. We significantly improve counting accuracy while maintaining natural layouts and temporal coherence. Abstract Text-to-video diffusion models have enabled open-ended video synthesis, but often struggle with generating the cor￾rect number of objects sp… view at source ↗
Figure 2
Figure 2. Figure 2: Visualization of the cross-attention maps corresponding [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The pipeline of our NUMINA follows a two-phase paradigm. Given a text prompt containing numerals, we first perform the [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: The PCA visualization of self-attention maps for [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative comparison of NUMINA with the most advanced commercial models. [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 7
Figure 7. Figure 7: Ablation on the reference timesteps t ⋆ for head selection. ready performs well for simple prompts requesting a few objects (e.g., 68.7% for two objects), as this relies more on category recognition than precise counting. How￾ever, its performance rapidly degrades as the ground truth count increases. For prompts requiring three objects, the baseline accuracy plummets to 44.5%. In contrast, NU￾MINA achieves… view at source ↗
Figure 8
Figure 8. Figure 8: PCA visualization across timesteps and layers. [PITH_FULL_IMAGE:figures/full_fig_p013_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: A failure case of NUMINA. The parrots’ heads become [PITH_FULL_IMAGE:figures/full_fig_p014_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: More representative examples where our method faithfully generates the specified number of objects. [PITH_FULL_IMAGE:figures/full_fig_p015_10.png] view at source ↗
read the original abstract

Text-to-video diffusion models have enabled open-ended video synthesis, but often struggle with generating the correct number of objects specified in a prompt. We introduce NUMINA , a training-free identify-then-guide framework for improved numerical alignment. NUMINA identifies prompt-layout inconsistencies by selecting discriminative self- and cross-attention heads to derive a countable latent layout. It then refines this layout conservatively and modulates cross-attention to guide regeneration. On the introduced CountBench, NUMINA improves counting accuracy by up to 7.4% on Wan2.1-1.3B, and by 4.9% and 5.5% on 5B and 14B models, respectively. Furthermore, CLIP alignment is improved while maintaining temporal consistency. These results demonstrate that structural guidance complements seed search and prompt enhancement, offering a practical path toward count-accurate text-to-video diffusion. The code is available at https://github.com/H-EmbodVis/NUMINA.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 3 minor

Summary. The paper introduces NUMINA, a training-free identify-then-guide framework for text-to-video diffusion models. It selects discriminative self- and cross-attention heads to derive a countable latent layout from the prompt, conservatively refines this layout, and modulates cross-attention during regeneration to better align generated object counts with textual numerals. On the introduced CountBench, it reports counting accuracy gains of up to 7.4% on Wan2.1-1.3B and 4.9%/5.5% on 5B/14B models, plus improved CLIP alignment with preserved temporal consistency, positioning the method as complementary to seed search and prompt engineering.

Significance. If validated, the work provides a practical, training-free structural guidance technique that leverages internal attention maps for count correction in T2V models, a persistent weakness in current systems. The new CountBench benchmark and public code release are positive contributions for reproducibility and future evaluation. The modest reported gains indicate incremental rather than transformative impact, but the approach could integrate usefully with other inference-time methods if the head-selection step proves robust.

major comments (3)
  1. [Experiments] Experiments section (results on CountBench): The reported accuracy improvements (7.4% on 1.3B, 4.9% and 5.5% on larger models) are presented without specifying CountBench size, exact counting metric and protocol, baselines (including whether they include prior attention-guidance or prompt-engineering methods), number of seeds per prompt, or any error bars/statistical tests. This information is load-bearing for the central empirical claim.
  2. [Method] Method section on discriminative head selection: The pipeline's first step assumes that a small subset of self- and cross-attention heads can be automatically identified to yield a reliable 'countable latent layout' that can be thresholded or clustered without high false-positive/negative rates. No ablation is described testing sensitivity to object category, spatial arrangement, prompt complexity, or model scale; if this step errs on even 10-15% of cases, the modest net gains would be erased or reversed.
  3. [Method] Method section on conservative refinement and cross-attention modulation: The precise algorithms, thresholds, and parameters for layout refinement and attention modulation are not fully specified (e.g., how 'conservative' is operationalized, how temporal consistency is enforced across frames). This makes it impossible to reproduce the guidance step or diagnose why CLIP improves while temporal metrics remain stable.
minor comments (3)
  1. [Experiments] The paper should include qualitative failure cases (prompts where NUMINA leaves the count unchanged or worsens it) to illustrate the limits of the head-selection heuristic.
  2. [Method] Notation for attention heads, latent layouts, and modulation operations should be introduced with explicit equations or pseudocode early in the method section for clarity.
  3. [Introduction] Related-work discussion could more explicitly contrast NUMINA with prior attention-map guidance techniques in diffusion models (e.g., those using cross-attention for layout control).

Simulated Author's Rebuttal

3 responses · 0 unresolved

Thank you for the constructive and detailed review of our manuscript. We address each major comment point by point below. We will revise the manuscript to incorporate the requested clarifications, details, and additional analyses where they strengthen the presentation of our work.

read point-by-point responses
  1. Referee: [Experiments] Experiments section (results on CountBench): The reported accuracy improvements (7.4% on 1.3B, 4.9% and 5.5% on larger models) are presented without specifying CountBench size, exact counting metric and protocol, baselines (including whether they include prior attention-guidance or prompt-engineering methods), number of seeds per prompt, or any error bars/statistical tests. This information is load-bearing for the central empirical claim.

    Authors: We agree that these experimental details are necessary to fully substantiate the central claims. In the revised manuscript we will expand the Experiments section to report the exact size of CountBench, the precise counting metric and evaluation protocol (including how objects are detected and counted in generated videos), a complete list of baselines that explicitly includes prior attention-guidance and prompt-engineering methods, the number of seeds evaluated per prompt, and error bars with appropriate statistical tests. These additions will allow readers to assess the reported gains rigorously. revision: yes

  2. Referee: [Method] Method section on discriminative head selection: The pipeline's first step assumes that a small subset of self- and cross-attention heads can be automatically identified to yield a reliable 'countable latent layout' that can be thresholded or clustered without high false-positive/negative rates. No ablation is described testing sensitivity to object category, spatial arrangement, prompt complexity, or model scale; if this step errs on even 10-15% of cases, the modest net gains would be erased or reversed.

    Authors: The discriminative head selection procedure is described in Section 3.2, where heads are chosen according to their attention focus on numerically relevant tokens. We acknowledge that dedicated ablations on robustness are absent from the current version. In the revision we will add an ablation study that systematically varies object category, spatial arrangement, prompt complexity, and model scale, reporting the impact on layout quality and final counting accuracy. This will quantify the reliability of the step and show how the subsequent conservative refinement limits error propagation. revision: yes

  3. Referee: [Method] Method section on conservative refinement and cross-attention modulation: The precise algorithms, thresholds, and parameters for layout refinement and attention modulation are not fully specified (e.g., how 'conservative' is operationalized, how temporal consistency is enforced across frames). This makes it impossible to reproduce the guidance step or diagnose why CLIP improves while temporal metrics remain stable.

    Authors: We will revise the Method section to supply the missing algorithmic details, including the exact thresholds, parameter values, and pseudocode for conservative layout refinement and cross-attention modulation. We will explicitly define the operationalization of 'conservative' refinement and describe the frame-to-frame consistency enforcement mechanism. These clarifications will make the guidance procedure fully reproducible from the text and will help explain the observed CLIP gains alongside stable temporal metrics. The released code already implements these steps; the paper revision will align the description with the implementation. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical training-free method with independent content

full rationale

The paper presents NUMINA as a training-free identify-then-guide framework that selects discriminative self- and cross-attention heads to derive a countable latent layout, refines it conservatively, and modulates cross-attention for regeneration. No equations, derivations, or fitted parameters are described that reduce the reported accuracy gains (e.g., up to 7.4% on CountBench) to the inputs by construction. The approach is benchmarked empirically on introduced data and multiple model scales while maintaining other metrics, with no load-bearing self-citations, uniqueness theorems, or ansatzes imported from prior author work. The central claims rest on observable improvements from the pipeline rather than self-referential definitions or renamed known results.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that attention patterns in diffusion models encode countable object information that can be extracted and used for guidance without retraining.

axioms (1)
  • domain assumption Discriminative self- and cross-attention heads encode a countable latent layout consistent with the text prompt
    The identify step of NUMINA relies on this to derive the layout before guidance.

pith-pipeline@v0.9.0 · 5489 in / 1249 out tokens · 27132 ms · 2026-05-10T18:36:06.356197+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

73 extracted references · 14 canonical work pages · 6 internal anchors

  1. [1]

    Introducing claude sonnet 4.5.https:// www.anthropic.com/news/claude- sonnet- 4- 5, 2025

    Anthropic. Introducing claude sonnet 4.5.https:// www.anthropic.com/news/claude- sonnet- 4- 5, 2025. 6

  2. [2]

    Uniedit: A unified tuning- free framework for video motion and appearance editing

    Jianhong Bai, Tianyu He, Yuchi Wang, Junliang Guo, Haoji Hu, Zuozhu Liu, and Jiang Bian. Uniedit: A unified tuning- free framework for video motion and appearance editing. In Proc. of ACM Multimedia, pages 10171–10180, 2025. 2

  3. [3]

    Lumiere: A space-time diffu- sion model for video generation

    Omer Bar-Tal, Hila Chefer, Omer Tov, Charles Her- rmann, Roni Paiss, Shiran Zada, Ariel Ephrat, Junhwa Hur, Guanghui Liu, Amit Raj, et al. Lumiere: A space-time diffu- sion model for video generation. InSIGGRAPH Asia Conf., pages 1–11, 2024. 1

  4. [4]

    Make it count: Text-to-image generation with an accurate number of objects

    Lital Binyamin, Yoad Tewel, Hilit Segev, Eran Hirsch, Royi Rassin, and Gal Chechik. Make it count: Text-to-image generation with an accurate number of objects. InProc. of IEEE Intl. Conf. on Computer Vision and Pattern Recogni- tion, pages 13242–13251, 2025. 3

  5. [5]

    Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

    Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023. 1

  6. [6]

    Align your latents: High-resolution video synthesis with la- tent diffusion models

    Andreas Blattmann, Robin Rombach, Huan Ling, Tim Dock- horn, Seung Wook Kim, Sanja Fidler, and Karsten Kreis. Align your latents: High-resolution video synthesis with la- tent diffusion models. InProc. of IEEE Intl. Conf. on Com- puter Vision and Pattern Recognition, pages 22563–22575,

  7. [7]

    Ditctrl: Exploring attention control in multi-modal dif- fusion transformer for tuning-free multi-prompt longer video generation

    Minghong Cai, Xiaodong Cun, Xiaoyu Li, Wenze Liu, Zhaoyang Zhang, Yong Zhang, Ying Shan, and Xiangyu Yue. Ditctrl: Exploring attention control in multi-modal dif- fusion transformer for tuning-free multi-prompt longer video generation. InProc. of IEEE Intl. Conf. on Computer Vision and Pattern Recognition, pages 7763–7772, 2025. 3

  8. [8]

    Emerg- ing properties in self-supervised vision transformers

    Mathilde Caron, Hugo Touvron, Ishan Misra, Herv ´e J´egou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerg- ing properties in self-supervised vision transformers. In Proc. of IEEE Intl. Conf. on Computer Vision, pages 9650– 9660, 2021. 12

  9. [9]

    VideoCrafter1: Open Diffusion Models for High-Quality Video Generation

    Haoxin Chen, Menghan Xia, Yingqing He, Yong Zhang, Xiaodong Cun, Shaoshu Yang, Jinbo Xing, Yaofang Liu, Qifeng Chen, Xintao Wang, et al. Videocrafter1: Open diffusion models for high-quality video generation.arXiv preprint arXiv:2310.19512, 2023. 2

  10. [10]

    Videocrafter2: Overcoming data limitations for high-quality video diffusion models

    Haoxin Chen, Yong Zhang, Xiaodong Cun, Menghan Xia, Xintao Wang, Chao Weng, and Ying Shan. Videocrafter2: Overcoming data limitations for high-quality video diffusion models. InProc. of IEEE Intl. Conf. on Computer Vision and Pattern Recognition, pages 7310–7320, 2024. 1

  11. [11]

    Gentron: Diffusion transformers for image and video generation

    Shoufa Chen, Mengmeng Xu, Jiawei Ren, Yuren Cong, Sen He, Yanping Xie, Animesh Sinha, Ping Luo, Tao Xiang, and Juan-Manuel Perez-Rua. Gentron: Diffusion transformers for image and video generation. InProc. of IEEE Intl. Conf. on Computer Vision and Pattern Recognition, pages 6441– 6451, 2024. 1

  12. [12]

    Segment and Track Anything.arXiv preprint arXiv:2305.06558, 2023

    Yangming Cheng, Liulei Li, Yuanyou Xu, Xiaodi Li, Zongxin Yang, Wenguan Wang, and Yi Yang. Segment and track anything.arXiv preprint arXiv:2305.06558, 2023. 3

  13. [13]

    Mean shift: A robust ap- proach toward feature space analysis.IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(5):603–619,

    Dorin Comaniciu and Peter Meer. Mean shift: A robust ap- proach toward feature space analysis.IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(5):603–619,

  14. [14]

    Homogen: Enhanced video inpainting via homography propagation and diffusion

    Ding Ding, Yueming Pan, Ruoyu Feng, Qi Dai, Kai Qiu, Jianmin Bao, Chong Luo, and Zhenzhong Chen. Homogen: Enhanced video inpainting via homography propagation and diffusion. InProc. of IEEE Intl. Conf. on Computer Vision and Pattern Recognition, pages 22953–22962, 2025. 2

  15. [15]

    A density-based algorithm for discovering clusters in large spatial databases with noise

    Martin Ester, Hans-Peter Kriegel, J ¨org Sander, Xiaowei Xu, et al. A density-based algorithm for discovering clusters in large spatial databases with noise. InProc. ACM SIGKDD Int. Conf. Knowledge Discovery & Data Mining, pages 226– 231, 1996. 5

  16. [16]

    Viewpoint: Panoramic video gen- eration with pretrained diffusion models

    Zixun Fang, Kai Zhu, Zhiheng Liu, Yu Liu, Wei Zhai, Yang Cao, and Zheng-Jun Zha. Viewpoint: Panoramic video gen- eration with pretrained diffusion models. InProc. of Ad- vances in Neural Information Processing Systems, 2025. 1

  17. [17]

    The devil is in the prompts: Retrieval-augmented prompt optimization for text-to-video generation

    Bingjie Gao, Xinyu Gao, Xiaoxue Wu, Yujie Zhou, Yu Qiao, Li Niu, Xinyuan Chen, and Yaohui Wang. The devil is in the prompts: Retrieval-augmented prompt optimization for text-to-video generation. InProc. of IEEE Intl. Conf. on Computer Vision and Pattern Recognition, pages 3173– 3183, 2025. 1

  18. [18]

    Videoswap: Customized video subject swapping with interactive semantic point cor- respondence

    Yuchao Gu, Yipin Zhou, Bichen Wu, Licheng Yu, Jia-Wei Liu, Rui Zhao, Jay Zhangjie Wu, David Junhao Zhang, Mike Zheng Shou, and Kevin Tang. Videoswap: Customized video subject swapping with interactive semantic point cor- respondence. InProc. of IEEE Intl. Conf. on Computer Vi- sion and Pattern Recognition, pages 7621–7630, 2024. 2

  19. [19]

    Keyframe-guided creative video inpainting

    Yuwei Guo, Ceyuan Yang, Anyi Rao, Chenlin Meng, Omer Bar-Tal, Shuangrui Ding, Maneesh Agrawala, Dahua Lin, and Bo Dai. Keyframe-guided creative video inpainting. In Proc. of IEEE Intl. Conf. on Computer Vision and Pattern Recognition, pages 13009–13020, 2025. 2

  20. [20]

    LTX-Video: Realtime Video Latent Diffusion

    Yoav HaCohen, Nisan Chiprut, Benny Brazowski, Daniel Shalem, Dudu Moshe, Eitan Richardson, Eran Levin, Guy Shiran, Nir Zabari, Ori Gordon, et al. Ltx-video: Realtime video latent diffusion.arXiv preprint arXiv:2501.00103,

  21. [21]

    Clipscore: A reference-free evaluation met- ric for image captioning

    Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. Clipscore: A reference-free evaluation met- ric for image captioning. InProc. Conference on Empirical Methods in Natural Language Processing, pages 7514–7528,

  22. [22]

    Imagen Video: High Definition Video Generation with Diffusion Models

    Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey Gritsenko, Diederik P Kingma, Ben Poole, Mohammad Norouzi, David J Fleet, et al. Imagen video: High definition video generation with diffusion mod- els.arXiv preprint arXiv:2210.02303, 2022. 2

  23. [23]

    Video dif- fusion models

    Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J Fleet. Video dif- fusion models. InProc. of Advances in Neural Information Processing Systems, pages 8633–8646, 2022. 2 9

  24. [24]

    Cogvideo: Large-scale pretraining for text-to-video generation via transformers

    Wenyi Hong, Ming Ding, Wendi Zheng, Xinghan Liu, and Jie Tang. Cogvideo: Large-scale pretraining for text-to-video generation via transformers. InProc. of Intl. Conf. on Learn- ing Representations, 2023. 2

  25. [25]

    Vbench: Comprehensive bench- mark suite for video generative models

    Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, et al. Vbench: Comprehensive bench- mark suite for video generative models. InProc. of IEEE Intl. Conf. on Computer Vision and Pattern Recognition, pages 21807–21818, 2024. 6, 12

  26. [26]

    Re- thinking fid: Towards a better evaluation metric for image generation

    Sadeep Jayasumana, Srikumar Ramalingam, Andreas Veit, Daniel Glasner, Ayan Chakrabarti, and Sanjiv Kumar. Re- thinking fid: Towards a better evaluation metric for image generation. InProc. of IEEE Intl. Conf. on Computer Vision and Pattern Recognition, pages 9307–9315, 2024. 6

  27. [27]

    Ground-a-video: Zero- shot grounded video editing using text-to-image diffusion models

    Hyeonho Jeong and Jong Chul Ye. Ground-a-video: Zero- shot grounded video editing using text-to-image diffusion models. InProc. of Intl. Conf. on Learning Representations,

  28. [28]

    Free2guide: Training-free text-to-video alignment using im- age lvlm

    Jaemin Kim, Bryan Sangwoo Kim, and Jong Chul Ye. Free2guide: Training-free text-to-video alignment using im- age lvlm. InProc. of IEEE Intl. Conf. on Computer Vision, pages 17920–17929, 2025. 1

  29. [29]

    Segment any- thing

    Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer White- head, Alexander C Berg, Wan-Yen Lo, et al. Segment any- thing. InProc. of IEEE Intl. Conf. on Computer Vision, pages 4015–4026, 2023. 3

  30. [30]

    HunyuanVideo: A Systematic Framework For Large Video Generative Models

    Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024. 2

  31. [31]

    Generative om- nimatte: Learning to decompose video into layers

    Yao-Chih Lee, Erika Lu, Sarah Rumbley, Michal Geyer, Jia- Bin Huang, Tali Dekel, and Forrester Cole. Generative om- nimatte: Learning to decompose video into layers. InProc. of IEEE Intl. Conf. on Computer Vision and Pattern Recog- nition, pages 12522–12532, 2025. 2

  32. [32]

    Drivingdiffusion: Layout-guided multi-view driving scenarios video genera- tion with latent diffusion model

    Xiaofan Li, Yifu Zhang, and Xiaoqing Ye. Drivingdiffusion: Layout-guided multi-view driving scenarios video genera- tion with latent diffusion model. InProc. of European Con- ference on Computer Vision, pages 469–485, 2024. 2

  33. [33]

    Video4edit: Viewing image editing as a degenerate tempo- ral process.arXiv preprint arXiv:2511.18131, 2025

    Xiaofan Li, Yanpeng Sun, Chenming Wu, Fan Duan, YuAn Wang, Weihao Bo, Yumeng Zhang, and Dingkang Liang. Video4edit: Viewing image editing as a degenerate tempo- ral process.arXiv preprint arXiv:2511.18131, 2025. 2

  34. [34]

    Driverse: Navigation world model for driving simulation via multi- modal trajectory prompting and motion alignment

    Xiaofan Li, Chenming Wu, Zhao Yang, Zhihao Xu, Yumeng Zhang, Dingkang Liang, Ji Wan, and Jun Wang. Driverse: Navigation world model for driving simulation via multi- modal trajectory prompting and motion alignment. InProc. of ACM Multimedia, pages 9753–9762, 2025. 2

  35. [35]

    arXiv preprint arXiv:2501.10018 (2025) 4, 10, 3

    Xiaowen Li, Haolan Xue, Peiran Ren, and Liefeng Bo. Dif- fueraser: A diffusion model for video inpainting.arXiv preprint arXiv:2501.10018, 2025. 2

  36. [36]

    Fvar: Visual autoregressive modeling via next focus prediction

    Xiaofan Li, Chenming Wu, Yanpeng Sun, Jiaming Zhou, Delin Qu, Yansong Qu, Weihao Bo, Haibao Yu, and Dingkang Liang. Fvar: Visual autoregressive modeling via next focus prediction. InProc. of IEEE Intl. Conf. on Com- puter Vision and Pattern Recognition, 2026. 1

  37. [37]

    net/forum?id=POWv6hDd9XH

    Zhimin Li, Jianwei Zhang, Qin Lin, Jiangfeng Xiong, Yanxin Long, Xinchi Deng, Yingfang Zhang, Xingchao Liu, Minbin Huang, Zedong Xiao, et al. Hunyuan-dit: A powerful multi-resolution diffusion transformer with fine-grained chi- nese understanding.arXiv preprint arXiv:2405.08748, 2024. 1, 3

  38. [38]

    Transcrowd: weakly-supervised crowd counting with transformers.Science China Information Sciences, 65(6): 160104, 2022

    Dingkang Liang, Xiwu Chen, Wei Xu, Yu Zhou, and Xiang Bai. Transcrowd: weakly-supervised crowd counting with transformers.Science China Information Sciences, 65(6): 160104, 2022. 3

  39. [39]

    An end-to-end transformer model for crowd localization

    Dingkang Liang, Wei Xu, and Xiang Bai. An end-to-end transformer model for crowd localization. InProc. of Euro- pean Conference on Computer Vision, pages 38–54, 2022

  40. [40]

    Fo- cal inverse distance transform maps for crowd localization

    Dingkang Liang, Wei Xu, Yingying Zhu, and Yu Zhou. Fo- cal inverse distance transform maps for crowd localization. IEEE Transactions on Multimedia, 25:6040–6052, 2022

  41. [41]

    Crowdclip: Unsupervised crowd counting via vision-language model

    Dingkang Liang, Jiahao Xie, Zhikang Zou, Xiaoqing Ye, Wei Xu, and Xiang Bai. Crowdclip: Unsupervised crowd counting via vision-language model. InProc. of IEEE Intl. Conf. on Computer Vision and Pattern Recognition, pages 2893–2903, 2023

  42. [42]

    Sood++: Leveraging unlabeled data to boost oriented object detection.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

    Dingkang Liang, Wei Hua, Chunsheng Shi, Zhikang Zou, Xiaoqing Ye, and Xiang Bai. Sood++: Leveraging unlabeled data to boost oriented object detection.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025. 3

  43. [43]

    Flow matching for generative mod- eling

    Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximil- ian Nickel, and Matt Le. Flow matching for generative mod- eling. InProc. of Intl. Conf. on Learning Representations,

  44. [44]

    Grounding dino: Marrying dino with grounded pre-training for open-set object detection

    Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. InProc. of European Conference on Computer Vision, pages 38–55,

  45. [45]

    Video-p2p: Video editing with cross-attention control

    Shaoteng Liu, Yuechen Zhang, Wenbo Li, Zhe Lin, and Jiaya Jia. Video-p2p: Video editing with cross-attention control. InProc. of IEEE Intl. Conf. on Computer Vision and Pattern Recognition, pages 8599–8608, 2024. 2

  46. [46]

    Flow straight and fast: Learning to generate and transfer data with rectified flow

    Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow. InProc. of Intl. Conf. on Learning Represen- tations, 2023. 3

  47. [47]

    Evalcrafter: Benchmarking and eval- uating large video generation models

    Yaofang Liu, Xiaodong Cun, Xuebo Liu, Xintao Wang, Yong Zhang, Haoxin Chen, Yang Liu, Tieyong Zeng, Raymond Chan, and Ying Shan. Evalcrafter: Benchmarking and eval- uating large video generation models. InProc. of IEEE Intl. Conf. on Computer Vision and Pattern Recognition, pages 22139–22149, 2024. 6

  48. [48]

    Latte: La- tent diffusion transformer for video generation.Transactions on Machine Learning Research, 2025

    Xin Ma, Yaohui Wang, Xinyuan Chen, Gengyun Jia, Ziwei Liu, Yuan-Fang Li, Cunjian Chen, and Yu Qiao. Latte: La- tent diffusion transformer for video generation.Transactions on Machine Learning Research, 2025. 2

  49. [49]

    Dreamix: Video diffusion models are general video editors

    Eyal Molad, Eliahu Horwitz, Dani Valevski, Alex Rav Acha, Yossi Matias, Yael Pritch, Yaniv Leviathan, and Yedid 10 Hoshen. Dreamix: Video diffusion models are general video editors.arXiv preprint arXiv:2302.01329, 2023. 2

  50. [50]

    Revideo: Remake a video with motion and content control

    Chong Mou, Mingdeng Cao, Xintao Wang, Zhaoyang Zhang, Ying Shan, and Jian Zhang. Revideo: Remake a video with motion and content control. InProc. of Advances in Neural Information Processing Systems, pages 18481– 18505, 2024. 2

  51. [51]

    Improved denoising diffusion probabilistic models

    Alexander Quinn Nichol and Prafulla Dhariwal. Improved denoising diffusion probabilistic models. InProc. of Intl. Conf. on Machine Learning, pages 8162–8171, 2021. 3

  52. [52]

    Introducing gpt-5.https://openai.com/ blog/introducing-gpt-5, 2025

    OpenAI. Introducing gpt-5.https://openai.com/ blog/introducing-gpt-5, 2025. 6

  53. [53]

    Scalable diffusion models with transformers

    William Peebles and Saining Xie. Scalable diffusion models with transformers. InProc. of IEEE Intl. Conf. on Computer Vision, pages 4195–4205, 2023. 1, 2, 3

  54. [54]

    Fatezero: Fus- ing attentions for zero-shot text-based video editing

    Chenyang Qi, Xiaodong Cun, Yong Zhang, Chenyang Lei, Xintao Wang, Ying Shan, and Qifeng Chen. Fatezero: Fus- ing attentions for zero-shot text-based video editing. InProc. of IEEE Intl. Conf. on Computer Vision, pages 15932–15942,

  55. [55]

    Omnimattezero: Fast training-free omn- imatte with pre-trained video diffusion models

    Dvir Samuel, Matan Levy, Nir Darshan, Gal Chechik, and Rami Ben-Ari. Omnimattezero: Fast training-free omn- imatte with pre-trained video diffusion models. InSIG- GRAPH Asia Conf., 2025. 2

  56. [56]

    T2v-compbench: A comprehen- sive benchmark for compositional text-to-video generation

    Kaiyue Sun, Kaiyi Huang, Xian Liu, Yue Wu, Zihan Xu, Zhenguo Li, and Xihui Liu. T2v-compbench: A comprehen- sive benchmark for compositional text-to-video generation. InProc. of IEEE Intl. Conf. on Computer Vision and Pattern Recognition, pages 8406–8416, 2025. 6, 12

  57. [57]

    Towards online real-time memory-based video inpainting transformers

    Guillaume Thiry, Hao Tang, Radu Timofte, and Luc Van Gool. Towards online real-time memory-based video inpainting transformers. InProc. of IEEE Intl. Conf. on Com- puter Vision and Pattern Recognition, pages 6035–6044,

  58. [58]

    Fvd: A new metric for video generation

    Thomas Unterthiner, Sjoerd Van Steenkiste, Karol Kurach, Rapha¨el Marinier, Marcin Michalski, and Sylvain Gelly. Fvd: A new metric for video generation. InProc. of Intl. Conf. on Learning Representations Workshop, 2019. 6

  59. [59]

    Wan: Open and Advanced Large-Scale Video Generative Models

    Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video gen- erative models.arXiv preprint arXiv:2503.20314, 2025. 1, 2, 3, 6, 12, 13

  60. [60]

    Towards transformer-based aligned gen- eration with self-coherence guidance

    Shulei Wang, Wang Lin, Hai Huang, Hanting Wang, Sihang Cai, WenKang Han, Tao Jin, Jingyuan Chen, Jiacheng Sun, Jieming Zhu, et al. Towards transformer-based aligned gen- eration with self-coherence guidance. InProc. of IEEE Intl. Conf. on Computer Vision and Pattern Recognition, pages 18455–18464, 2025. 2

  61. [61]

    Videocomposer: Compositional video synthesis with motion controllability

    Xiang Wang, Hangjie Yuan, Shiwei Zhang, Dayou Chen, Ji- uniu Wang, Yingya Zhang, Yujun Shen, Deli Zhao, and Jin- gren Zhou. Videocomposer: Compositional video synthesis with motion controllability. InProc. of Advances in Neural Information Processing Systems, pages 7594–7611, 2023. 2

  62. [62]

    Mo- tionbooth: Motion-aware customized text-to-video genera- tion

    Jianzong Wu, Xiangtai Li, Yanhong Zeng, Jiangning Zhang, Qianyu Zhou, Yining Li, Yunhai Tong, and Kai Chen. Mo- tionbooth: Motion-aware customized text-to-video genera- tion. InProc. of Advances in Neural Information Processing Systems, pages 34322–34348, 2024. 1

  63. [63]

    Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation

    Jay Zhangjie Wu, Yixiao Ge, Xintao Wang, Stan Weixian Lei, Yuchao Gu, Yufei Shi, Wynne Hsu, Ying Shan, Xiaohu Qie, and Mike Zheng Shou. Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation. In Proc. of IEEE Intl. Conf. on Computer Vision, pages 7623– 7633, 2023. 2

  64. [64]

    Towards a better metric for text-to-video generation.arXiv preprint arXiv:2401.07781,

    Jay Zhangjie Wu, Guian Fang, Haoning Wu, Xintao Wang, Yixiao Ge, Xiaodong Cun, David Junhao Zhang, Jia-Wei Liu, Yuchao Gu, Rui Zhao, et al. Towards a better metric for text-to-video generation.arXiv preprint arXiv:2401.07781,

  65. [65]

    Draganything: Motion control for any- thing using entity representation

    Weijia Wu, Zhuang Li, Yuchao Gu, Rui Zhao, Yefei He, David Junhao Zhang, Mike Zheng Shou, Yan Li, Tingting Gao, and Di Zhang. Draganything: Motion control for any- thing using entity representation. InProc. of European Con- ference on Computer Vision, pages 331–348, 2024. 2

  66. [66]

    Vtoonify: Controllable high-resolution portrait video style transfer.ACM Transactions ON Graphics, 41(6):1–15, 2022

    Shuai Yang, Liming Jiang, Ziwei Liu, and Chen Change Loy. Vtoonify: Controllable high-resolution portrait video style transfer.ACM Transactions ON Graphics, 41(6):1–15, 2022. 2

  67. [67]

    Motion- guided latent diffusion for temporally consistent real-world video super-resolution

    Xi Yang, Chenhang He, Jianqi Ma, and Lei Zhang. Motion- guided latent diffusion for temporally consistent real-world video super-resolution. InProc. of European Conference on Computer Vision, pages 224–242, 2024. 1

  68. [68]

    Videograin: Modulating space-time attention for multi- grained video editing

    Xiangpeng Yang, Linchao Zhu, Hehe Fan, and Yi Yang. Videograin: Modulating space-time attention for multi- grained video editing. InProc. of Intl. Conf. on Learning Representations, 2025. 2

  69. [69]

    Dualdiff+: Dual-branch diffusion for high-fidelity video generation with reward guidance.arXiv preprint arXiv:2503.03689, 2025

    Zhao Yang, Zezhong Qian, Xiaofan Li, Weixiang Xu, Gongpeng Zhao, Ruohong Yu, Lingsi Zhu, and Longjun Liu. Dualdiff+: Dual-branch diffusion for high-fidelity video generation with reward guidance.arXiv preprint arXiv:2503.03689, 2025. 2

  70. [70]

    Cogvideox: Text-to-video diffusion models with an expert transformer

    Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiao- han Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer. InProc. of Intl. Conf. on Learning Representations, 2025. 12, 13

  71. [71]

    Stylemaster: Stylize your video with artistic generation and translation

    Zixuan Ye, Huijuan Huang, Xintao Wang, Pengfei Wan, Di Zhang, and Wenhan Luo. Stylemaster: Stylize your video with artistic generation and translation. InProc. of IEEE Intl. Conf. on Computer Vision and Pattern Recognition, pages 2630–2640, 2025. 2

  72. [72]

    Towards precise scaling laws for video diffusion transformers

    Yuanyang Yin, Yaqi Zhao, Mingwu Zheng, Ke Lin, Jiarong Ou, Rui Chen, Victor Shea-Jay Huang, Jiahao Wang, Xin Tao, Pengfei Wan, et al. Towards precise scaling laws for video diffusion transformers. InProc. of IEEE Intl. Conf. on Computer Vision and Pattern Recognition, pages 18155– 18165, 2025. 2

  73. [73]

    arXiv preprint arXiv:2507.02860 (2025) 4 1.x-Distill 19 Appendix Table of Contents 1 Introduction

    Xin Zhou, Dingkang Liang, Kaijin Chen, Tianrui Feng, Xiwu Chen, Hongkai Lin, Yikang Ding, Feiyang Tan, Heng- shuang Zhao, and Xiang Bai. Less is enough: Training-free video diffusion acceleration via runtime-adaptive caching. arXiv preprint arXiv:2507.02860, 2025. 2, 8 11 When Numbers Speak: Aligning Textual Numerals and Visual Instances in Text-to-Video ...