pith. machine review for the scientific record. sign in

arxiv: 2604.26503 · v1 · submitted 2026-04-29 · 💻 cs.CV

Recognition: unknown

Delta Score Matters! Spatial Adaptive Multi Guidance in Diffusion Models

Authors on Pith no claims yet

Pith reviewed 2026-05-07 11:02 UTC · model grok-4.3

classification 💻 cs.CV
keywords diffusion modelsclassifier-free guidanceadaptive guidanceimage generationvideo generationdata manifolddetail-artifact dilemma
0
0 comments X

The pith

Point-wise adaptive guidance scales resolve the detail-artifact trade-off in diffusion models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard classifier-free guidance applies one fixed scale across every pixel or frame, so low values miss fine semantic details while high values create over-saturation, structural breaks, and video flicker. The paper traces the root cause to the fact that this uniform scale performs a straight-line extrapolation on a curved data manifold, pushing samples off the manifold in the orthogonal direction. SAMG measures a local conditional guidance energy at each point, then applies a low scale only where energy is high (near boundaries and textures) to stay safe and a high scale where energy is low to push semantics harder. The method is training-free and adds essentially zero cost. Experiments on image and video diffusion models show clearer semantics, preserved structure, and smoother motion.

Core claim

CFG performs a tangential linear extrapolation on the curved natural data manifold; the resulting orthogonal deviation is the source of the detail-artifact dilemma. SAMG derives a geometric upper bound on guidance and replaces the global scalar with point-wise conditional guidance energy, applying the conservative minimum scale at high-energy boundary regions to protect micro-textures and the aggressive maximum scale at low-energy regions to maximize semantic injection.

What carries the argument

Point-wise conditional guidance energy that selects between minimum and maximum guidance scales at each location so the sampling trajectory remains bounded on the data manifold.

Load-bearing premise

The point-wise conditional guidance energy correctly flags which regions need conservative versus aggressive scales, and the derived geometric upper bound actually keeps the trajectory on the manifold.

What would settle it

Generate the same prompts with and without SAMG on a fixed diffusion model, then measure whether SAMG fails to improve (or worsens) both semantic alignment scores and structural/temporal artifact metrics.

Figures

Figures reproduced from arXiv: 2604.26503 by Bowen Tian, Haosen Li, Lei Wang, Shaofeng Liang, Soning Lai, Wenshuo Chen, Yutao Yue.

Figure 1
Figure 1. Figure 1: Visualizing the “detail-artifact dilemma” in Classifier-Free Guidance and our solution. view at source ↗
Figure 2
Figure 2. Figure 2: Qualitative comparison of Conditional generation, High CFG, and SAMG. Prompt: “close view at source ↗
Figure 3
Figure 3. Figure 3: Visualization of the generated image and the evolution of spatial guidance energy maps view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative comparison of video generation. view at source ↗
Figure 5
Figure 5. Figure 5: Visualization of decoupled latent energy evolution across the four SDXL channels. view at source ↗
Figure 6
Figure 6. Figure 6: A limitation of SAMG in handling dense semantic overlaps. view at source ↗
read the original abstract

Diffusion models have achieved remarkable success in synthesizing complex static and temporal visuals, a breakthrough largely driven by Classifier-Free Guidance (CFG). However, despite its pivotal role in aligning generated content with textual prompts, standard CFG relies on a globally uniform scalar. This homogeneous amplification traps models in a well-documented "detail-artifact dilemma": low guidance scales fail to inject intricate semantics, while high scales inevitably cause structural degradation, color over-saturation, and temporal inconsistencies in videos. In this paper, we expose the physical root of this flaw through the lens of differential geometry. By analyzing Tweedie's Formula, we reveal that CFG intrinsically performs a tangential linear extrapolation. Because the natural data manifold is highly curved, this uniform linear step introduces a severe orthogonal deviation. To keep the generation trajectory safely bounded, we formulate a theoretical upper bound for spatial and adaptive guidance. Based on these geometric insights, we propose Spatial Adaptive Multi Guidance (SAMG), a training-free and virtually zero-cost sampling algorithm. SAMG dynamically computes point-wise conditional guidance energy, applying a conservative minimum scale to high-energy boundary regions to preserve delicate micro-textures, while deploying an aggressive maximum scale in low-energy regions to maximize semantic injection. Extensive experiments across diverse image (SD 1.5, SDXL, SD3.5 Medium) and video (CogVideoX, ModelScope) architectures demonstrate that SAMG effectively resolves the detail-artifact dilemma, achieving superior semantic alignment, structural integrity, and temporal smoothness without any computational overhead.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper claims that Classifier-Free Guidance (CFG) in diffusion models performs tangential linear extrapolation on a curved data manifold (via Tweedie's formula), causing orthogonal deviations that manifest as the detail-artifact dilemma. It derives a theoretical upper bound on guidance scales from differential geometry to keep trajectories bounded, then introduces Spatial Adaptive Multi Guidance (SAMG): a training-free sampler that computes point-wise conditional guidance energy and applies conservative (min) scales to high-energy regions while using aggressive (max) scales in low-energy regions. Experiments on image (SD 1.5/ XL/ 3.5) and video (CogVideoX, ModelScope) models reportedly show improved semantic alignment, structure, and temporal consistency at no extra cost.

Significance. If the geometric upper bound is rigorously derived and the adaptive energy rule is shown to respect it, SAMG would offer a principled, zero-overhead improvement to a core sampling technique used across generative vision. The training-free nature and multi-architecture validation are strengths; a sound manifold-based justification could influence guidance design beyond ad-hoc scaling.

major comments (3)
  1. [§3] §3 (geometric analysis): The upper bound on orthogonal deviation is derived from the curvature of the data manifold and Tweedie's formula, but the subsequent definition of 'conditional guidance energy' (used to decide min vs. max scale per pixel) is not shown to be bounded by or equivalent to this quantity. The energy appears to be an empirical score difference without an explicit inequality linking it to the manifold deviation bound, so the claim that SAMG 'keeps trajectories safely bounded' rests on an unverified identification rather than a proven relation.
  2. [§4.1] §4.1 (SAMG algorithm): The adaptive rule applies min_scale to high-energy boundary regions and max_scale to low-energy regions, yet no ablation or sensitivity analysis is provided on how the energy threshold or the specific min/max values (free parameters) affect the bound; if the energy is not monotonically related to deviation magnitude, the rule could violate the theoretical upper bound in some regions.
  3. [Experiments] Experiments section: While qualitative and quantitative results are reported across SD 1.5, SDXL, SD3.5, CogVideoX and ModelScope, the paper does not include a direct comparison of the proposed energy against alternative proxies (e.g., gradient magnitude or reconstruction error) to confirm it is the correct indicator of required scale; without this, superiority over uniform CFG or other adaptive baselines may be overstated.
minor comments (2)
  1. [§4] Notation for the conditional guidance energy should be introduced with an explicit equation number and its normalization clarified (e.g., whether it is L2-normed or per-channel).
  2. [Abstract] The abstract states 'virtually zero-cost' but the method requires an extra forward pass for the unconditional score at each step; clarify the exact overhead relative to standard CFG.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the careful and constructive review. The comments identify important areas where the connection between the geometric analysis and the SAMG algorithm can be made more rigorous, and where additional experimental validation would strengthen the claims. We address each major comment point by point below and will revise the manuscript to incorporate the suggested clarifications and analyses.

read point-by-point responses
  1. Referee: [§3] §3 (geometric analysis): The upper bound on orthogonal deviation is derived from the curvature of the data manifold and Tweedie's formula, but the subsequent definition of 'conditional guidance energy' (used to decide min vs. max scale per pixel) is not shown to be bounded by or equivalent to this quantity. The energy appears to be an empirical score difference without an explicit inequality linking it to the manifold deviation bound, so the claim that SAMG 'keeps trajectories safely bounded' rests on an unverified identification rather than a proven relation.

    Authors: We appreciate the referee's precise identification of this gap. Section 3 derives the upper bound on orthogonal deviation from the curvature term in the local approximation of Tweedie's formula and shows that uniform CFG produces deviations orthogonal to the manifold. The conditional guidance energy is introduced as the point-wise score difference, which is the quantity that drives the orthogonal component under the same local linearization. While the manuscript presents this as a direct link, we acknowledge that an explicit inequality establishing that the energy is bounded by (or proportional to) the curvature-derived deviation term was not formally stated. In the revision we will add a short derivation in §3 showing that, under the first-order approximation used for the bound, the energy is monotonically related to the deviation magnitude, thereby justifying that conservative scaling in high-energy regions keeps trajectories within the derived bound. revision: yes

  2. Referee: [§4.1] §4.1 (SAMG algorithm): The adaptive rule applies min_scale to high-energy boundary regions and max_scale to low-energy regions, yet no ablation or sensitivity analysis is provided on how the energy threshold or the specific min/max values (free parameters) affect the bound; if the energy is not monotonically related to deviation magnitude, the rule could violate the theoretical upper bound in some regions.

    Authors: We agree that the absence of sensitivity analysis leaves the robustness of the min/max rule open to question. The design choice (min scale on high-energy regions, max on low-energy) follows from the geometric observation that high-energy locations are where the orthogonal deviation is largest. To address the concern directly, the revised manuscript will include an ablation subsection in §4.1 that (i) varies the energy threshold and the concrete min/max values over a range, (ii) reports the resulting deviation metrics on both synthetic curved manifolds and real diffusion trajectories, and (iii) verifies that the chosen rule never exceeds the theoretical upper bound derived in §3. We will also add a brief monotonicity check (empirical correlation between energy and measured deviation) to confirm the rule does not inadvertently violate the bound. revision: yes

  3. Referee: [Experiments] Experiments section: While qualitative and quantitative results are reported across SD 1.5, SDXL, SD3.5, CogVideoX and ModelScope, the paper does not include a direct comparison of the proposed energy against alternative proxies (e.g., gradient magnitude or reconstruction error) to confirm it is the correct indicator of required scale; without this, superiority over uniform CFG or other adaptive baselines may be overstated.

    Authors: We accept that a direct comparison of the conditional guidance energy against other candidate proxies would provide stronger evidence that it is the most appropriate indicator. The current experiments demonstrate consistent gains over uniform CFG and several published adaptive baselines, but they do not benchmark the energy metric itself against alternatives such as gradient magnitude or reconstruction error. In the revised version we will add a targeted comparison in the Experiments section: for each model we will compute three proxies (our energy, gradient magnitude, and reconstruction error), apply the same min/max adaptive rule with each proxy, and report both quantitative metrics (FID, CLIP score, temporal consistency) and qualitative examples. This will allow readers to see which proxy best correlates with reduced artifacts while preserving semantic alignment. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper's chain proceeds from a standard reference (Tweedie's formula) to an interpretive geometric analysis of CFG as tangential extrapolation, followed by formulation of a theoretical upper bound on orthogonal deviation and a heuristic adaptive rule (point-wise conditional guidance energy) that applies min/max scales spatially. No equation is shown to equal a fitted input by construction, no load-bearing premise reduces to a self-citation whose validity depends on the present work, and the proposed SAMG algorithm is presented as a training-free heuristic motivated by the geometric insight rather than a statistical prediction forced by data reuse. The derivation therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The central claim rests on a geometric reinterpretation of CFG and an unverified upper bound; no independent evidence or shipped code is supplied in the abstract.

free parameters (2)
  • min_scale
    Conservative guidance scale applied to high-energy boundary regions; value not specified in abstract.
  • max_scale
    Aggressive guidance scale applied to low-energy regions; value not specified in abstract.
axioms (1)
  • domain assumption CFG performs tangential linear extrapolation on a highly curved natural data manifold
    Invoked via analysis of Tweedie's Formula in the abstract.

pith-pipeline@v0.9.0 · 5583 in / 1278 out tokens · 66517 ms · 2026-05-07T11:02:21.220036+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Before the Body Moves: Learning Anticipatory Joint Intent for Language-Conditioned Humanoid Control

    cs.RO 2026-05 unverdicted novelty 5.0

    DAJI learns future-aware joint intents from language to enable proactive humanoid control, reporting 94.42% rollout success on HumanML3D-style tasks and 0.152 subsequence FID on BABEL.

Reference graph

Works this paper leans on

39 extracted references · 30 canonical work pages · cited by 1 Pith paper · 17 internal anchors

  1. [1]

    Self-rectifying diffusion sampling with perturbed-attention guidance.arXiv preprint arXiv:2403.17377, 2024

    Donghoon Ahn, Hyoungwon Cho, Jaewon Min, Wooseok Jang, Jungwoo Kim, SeonHwa Kim, Hyun Hee Park, Kyong Hwan Jin, and Seungryong Kim. Self-rectifying diffusion sampling with perturbed-attention guidance.arXiv preprint arXiv:2403.17377, 2024

  2. [2]

    Analytic-dpm: an analytic estimate of the optimal reverse variance in diffusion probabilistic models, 2022

    Fan Bao, Chongxuan Li, Jun Zhu, and Bo Zhang. Analytic-dpm: an analytic estimate of the optimal reverse variance in diffusion probabilistic models, 2022. URLhttps://arxiv.org/ abs/2201.06503

  3. [3]

    De Bortoli, E

    Valentin De Bortoli, Emile Mathieu, Michael Hutchinson, James Thornton, Yee Whye Teh, and Arnaud Doucet. Riemannian score-based generative modelling, 2022. URLhttps:// arxiv.org/abs/2202.02763

  4. [4]

    Available: https://arxiv.org/abs/2406.08070

    Hyungjin Chung, Jeongsol Kim, Geon Yeong Park, Hyelin Nam, and Jong Chul Ye. Cfg++: Manifold-constrained classifier free guidance for diffusion models, 2024. URLhttps://arxiv. org/abs/2406.08070. 23

  5. [5]

    Scaling Rectified Flow Transformers for High-Resolution Image Synthesis

    Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas M¨ uller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, Dustin Podell, Tim Dockhorn, Zion English, Kyle Lacey, Alex Goodwin, Yannik Marek, and Robin Rombach. Scaling rectified flow transformers for high-resolution image synthesis, 2024. URLhttps://arxiv.org/abs/ 2403.03206

  6. [6]

    Yeh, and Ziwei Liu

    Weichen Fan, Amber Yijia Zheng, Raymond A. Yeh, and Ziwei Liu. Cfg-zero*: Improved classifier-free guidance for flow matching models, 2025. URLhttps://arxiv.org/abs/2503. 18886

  7. [7]

    Geneval: An object-focused frame- work for evaluating text-to-image alignment, 2023

    Dhruba Ghosh, Hanna Hajishirzi, and Ludwig Schmidt. Geneval: An object-focused frame- work for evaluating text-to-image alignment, 2023. URLhttps://arxiv.org/abs/2310. 11513

  8. [8]

    I. J. Good.Introduction to Robbins (1955) An Empirical Bayes Approach to Statistics, pages 379–387. Springer New York, New York, NY, 1992. ISBN 978-1-4612-0919-5. doi: 10.1007/ 978-1-4612-0919-5 25. URLhttps://doi.org/10.1007/978-1-4612-0919-5_25

  9. [9]

    Classifier-Free Diffusion Guidance

    Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance, 2022. URLhttps://arxiv. org/abs/2207.12598

  10. [10]

    Denoising Diffusion Probabilistic Models

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models, 2020. URLhttps://arxiv.org/abs/2006.11239

  11. [11]

    Elucidating the Design Space of Diffusion-Based Generative Models

    Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the design space of diffusion-based generative models, 2022. URLhttps://arxiv.org/abs/2206.00364

  12. [12]

    Pick-a-pic: An open dataset of user preferences for text-to-image generation, 2023

    Yuval Kirstain, Adam Polyak, Uriel Singer, Shahbuland Matiana, Joe Penna, and Omer Levy. Pick-a-pic: An open dataset of user preferences for text-to-image generation, 2023. URL https://arxiv.org/abs/2305.01569

  13. [13]

    Imagenet classification with deep convolutional neural networks

    Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In F. Pereira, C.J. Burges, L. Bottou, and K.Q. Wein- berger, editors,Advances in Neural Information Processing Systems, volume 25. Curran As- sociates, Inc., 2012. URLhttps://proceedings.neurips.cc/paper_files/paper/2012/ file/c399862d...

  14. [14]

    arXiv:2305.08891 , year=

    Shanchuan Lin, Bingchen Liu, Jiashi Li, and Xiao Yang. Common diffusion noise schedules and sample steps are flawed, 2024. URLhttps://arxiv.org/abs/2305.08891

  15. [15]

    Microsoft COCO: Common Objects in Context

    Tsung-Yi Lin, Michael Maire, Serge Belongie, Lubomir Bourdev, Ross Girshick, James Hays, Pietro Perona, Deva Ramanan, C. Lawrence Zitnick, and Piotr Doll´ ar. Microsoft coco: Com- mon objects in context, 2015. URLhttps://arxiv.org/abs/1405.0312

  16. [16]

    Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling, 2023. URLhttps://arxiv.org/abs/2210.02747

  17. [17]

    Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

    Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow, 2022. URLhttps://arxiv.org/abs/2209.03003

  18. [18]

    Understanding ssim.arXiv preprint arXiv:2006.13846, 2020

    Jim Nilsson and Tomas Akenine-M¨ oller. Understanding ssim, 2020. URLhttps://arxiv. org/abs/2006.13846. 24

  19. [19]

    SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

    Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas M¨ uller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis, 2023. URLhttps://arxiv.org/abs/2307.01952

  20. [20]

    The intrinsic dimension of images and its impact on learning

    Phillip Pope, Chen Zhu, Ahmed Abdelkader, Micah Goldblum, and Tom Goldstein. The intrinsic dimension of images and its impact on learning, 2021. URLhttps://arxiv.org/ abs/2104.08894

  21. [21]

    Learning Transferable Visual Models From Natural Language Supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agar- wal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision, 2021. URL https://arxiv.org/abs/2103.00020

  22. [22]

    Neural Computation , volume=

    Martin Raphan and Eero P. Simoncelli. Least squares estimation without priors or supervision. Neural Computation, 23(2):374–420, 2011. doi: 10.1162/NECO a 00076

  23. [23]

    High-Resolution Image Synthesis with Latent Diffusion Models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj¨ orn Ommer. High-resolution image synthesis with latent diffusion models, 2022. URLhttps://arxiv. org/abs/2112.10752

  24. [24]

    Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding

    Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Denton, Seyed Kamyar Seyed Ghasemipour, Burcu Karagol Ayan, S. Sara Mahdavi, Rapha Gontijo Lopes, Tim Salimans, Jonathan Ho, David J Fleet, and Mohammad Norouzi. Photorealistic text-to- image diffusion models with deep language understanding, 2022. URLhttps://arxiv.org/ abs/2205.11487

  25. [25]

    LAION-5B: An open large-scale dataset for training next generation image-text models

    Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, Patrick Schramowski, Srivatsa Kundurthy, Katherine Crowson, Ludwig Schmidt, Robert Kaczmar- czyk, and Jenia Jitsev. Laion-5b: An open large-scale dataset for training next generation image-text model...

  26. [26]

    pytorch-fid: FID Score for PyTorch.https://github.com/mseitzer/ pytorch-fid, August 2020

    Maximilian Seitzer. pytorch-fid: FID Score for PyTorch.https://github.com/mseitzer/ pytorch-fid, August 2020. Version 0.3.0

  27. [27]

    Rethinking the spatial inconsistency in classifier-free diffusion guidance, 2024

    Dazhong Shen, Guanglu Song, Zeyue Xue, Fu-Yun Wang, and Yu Liu. Rethinking the spatial inconsistency in classifier-free diffusion guidance, 2024. URLhttps://arxiv.org/abs/2404. 05384

  28. [28]

    Freeu: Free lunch in diffusion u-net,

    Chenyang Si, Ziqi Huang, Yuming Jiang, and Ziwei Liu. Freeu: Free lunch in diffusion u-net,

  29. [29]

    URLhttps://arxiv.org/abs/2309.11497

  30. [30]

    Denoising Diffusion Implicit Models

    Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models, 2022. URLhttps://arxiv.org/abs/2010.02502

  31. [31]

    Sliced score matching: A scalable approach to density and score estimation, 2019

    Yang Song, Sahaj Garg, Jiaxin Shi, and Stefano Ermon. Sliced score matching: A scalable approach to density and score estimation, 2019. URLhttps://arxiv.org/abs/1905.07088

  32. [32]

    Score-Based Generative Modeling through Stochastic Differential Equations

    Yang Song, Jascha Sohl-Dickstein, Diederik P. Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations, 2021. URLhttps://arxiv.org/abs/2011.13456. 25

  33. [33]

    ModelScope Text-to-Video Technical Report

    Jiuniu Wang, Hangjie Yuan, Dayou Chen, Yingya Zhang, Xiang Wang, and Shiwei Zhang. Modelscope text-to-video technical report.arXiv preprint arXiv:2308.06571, 2023

  34. [34]

    Human Preference Score v2: A Solid Benchmark for Evaluating Human Preferences of Text-to-Image Synthesis

    Xiaoshi Wu, Yiming Hao, Keqiang Sun, Yixiong Chen, Feng Zhu, Rui Zhao, and Hongsheng Li. Human preference score v2: A solid benchmark for evaluating human preferences of text- to-image synthesis.arXiv preprint arXiv:2306.09341, 2023

  35. [35]

    Imagereward: Learning and evaluating human preferences for text-to-image generation,

    Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai Li, Ming Ding, Jie Tang, and Yuxiao Dong. Imagereward: Learning and evaluating human preferences for text-to-image generation,

  36. [36]

    URLhttps://arxiv.org/abs/2304.05977

  37. [37]

    CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

    Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072, 2024

  38. [38]

    Chronomagic-bench: A benchmark for meta- morphic evaluation of text-to-time-lapse video generation.Advances in Neural Information Processing Systems, 37:21236–21270, 2024

    Shenghai Yuan, Jinfa Huang, Yongqi Xu, Yaoyang Liu, Shaofeng Zhang, Yujun Shi, Rui-Jie Zhu, Xinhua Cheng, Jiebo Luo, and Li Yuan. Chronomagic-bench: A benchmark for meta- morphic evaluation of text-to-time-lapse video generation.Advances in Neural Information Processing Systems, 37:21236–21270, 2024

  39. [39]

    The unrea- sonable effectiveness of deep features as a perceptual metric

    Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unrea- sonable effectiveness of deep features as a perceptual metric. InCVPR, 2018. 26