arxiv: 2604.26503 · v1 · submitted 2026-04-29 · 💻 cs.CV

Recognition: unknown

Delta Score Matters! Spatial Adaptive Multi Guidance in Diffusion Models

Haosen Li , Wenshuo Chen , Lei Wang , Shaofeng Liang , Bowen Tian , Soning Lai , Yutao Yue

Authors on Pith no claims yet

Pith reviewed 2026-05-07 11:02 UTC · model grok-4.3

classification 💻 cs.CV

keywords diffusion modelsclassifier-free guidanceadaptive guidanceimage generationvideo generationdata manifolddetail-artifact dilemma

0 comments

The pith

Point-wise adaptive guidance scales resolve the detail-artifact trade-off in diffusion models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard classifier-free guidance applies one fixed scale across every pixel or frame, so low values miss fine semantic details while high values create over-saturation, structural breaks, and video flicker. The paper traces the root cause to the fact that this uniform scale performs a straight-line extrapolation on a curved data manifold, pushing samples off the manifold in the orthogonal direction. SAMG measures a local conditional guidance energy at each point, then applies a low scale only where energy is high (near boundaries and textures) to stay safe and a high scale where energy is low to push semantics harder. The method is training-free and adds essentially zero cost. Experiments on image and video diffusion models show clearer semantics, preserved structure, and smoother motion.

Core claim

CFG performs a tangential linear extrapolation on the curved natural data manifold; the resulting orthogonal deviation is the source of the detail-artifact dilemma. SAMG derives a geometric upper bound on guidance and replaces the global scalar with point-wise conditional guidance energy, applying the conservative minimum scale at high-energy boundary regions to protect micro-textures and the aggressive maximum scale at low-energy regions to maximize semantic injection.

What carries the argument

Point-wise conditional guidance energy that selects between minimum and maximum guidance scales at each location so the sampling trajectory remains bounded on the data manifold.

Load-bearing premise

The point-wise conditional guidance energy correctly flags which regions need conservative versus aggressive scales, and the derived geometric upper bound actually keeps the trajectory on the manifold.

What would settle it

Generate the same prompts with and without SAMG on a fixed diffusion model, then measure whether SAMG fails to improve (or worsens) both semantic alignment scores and structural/temporal artifact metrics.

Figures

Figures reproduced from arXiv: 2604.26503 by Bowen Tian, Haosen Li, Lei Wang, Shaofeng Liang, Soning Lai, Wenshuo Chen, Yutao Yue.

**Figure 1.** Figure 1: Visualizing the “detail-artifact dilemma” in Classifier-Free Guidance and our solution. view at source ↗

**Figure 2.** Figure 2: Qualitative comparison of Conditional generation, High CFG, and SAMG. Prompt: “close view at source ↗

**Figure 3.** Figure 3: Visualization of the generated image and the evolution of spatial guidance energy maps view at source ↗

**Figure 4.** Figure 4: Qualitative comparison of video generation. view at source ↗

**Figure 5.** Figure 5: Visualization of decoupled latent energy evolution across the four SDXL channels. view at source ↗

**Figure 6.** Figure 6: A limitation of SAMG in handling dense semantic overlaps. view at source ↗

read the original abstract

Diffusion models have achieved remarkable success in synthesizing complex static and temporal visuals, a breakthrough largely driven by Classifier-Free Guidance (CFG). However, despite its pivotal role in aligning generated content with textual prompts, standard CFG relies on a globally uniform scalar. This homogeneous amplification traps models in a well-documented "detail-artifact dilemma": low guidance scales fail to inject intricate semantics, while high scales inevitably cause structural degradation, color over-saturation, and temporal inconsistencies in videos. In this paper, we expose the physical root of this flaw through the lens of differential geometry. By analyzing Tweedie's Formula, we reveal that CFG intrinsically performs a tangential linear extrapolation. Because the natural data manifold is highly curved, this uniform linear step introduces a severe orthogonal deviation. To keep the generation trajectory safely bounded, we formulate a theoretical upper bound for spatial and adaptive guidance. Based on these geometric insights, we propose Spatial Adaptive Multi Guidance (SAMG), a training-free and virtually zero-cost sampling algorithm. SAMG dynamically computes point-wise conditional guidance energy, applying a conservative minimum scale to high-energy boundary regions to preserve delicate micro-textures, while deploying an aggressive maximum scale in low-energy regions to maximize semantic injection. Extensive experiments across diverse image (SD 1.5, SDXL, SD3.5 Medium) and video (CogVideoX, ModelScope) architectures demonstrate that SAMG effectively resolves the detail-artifact dilemma, achieving superior semantic alignment, structural integrity, and temporal smoothness without any computational overhead.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SAMG gives a practical spatially varying guidance rule that might help the detail-artifact issue, but the claimed geometric derivation does not appear to justify the specific energy-based scaling.

read the letter

The main point is that the authors take the known problem with uniform classifier-free guidance and try to fix it with a spatially adaptive scale that uses local conditional guidance energy to choose between min and max values at each pixel. They frame CFG as a tangential linear step on a curved data manifold via Tweedie's formula and differential geometry, then introduce SAMG as a training-free way to apply conservative scales where energy is high and aggressive ones where it is low. That framing and the zero-overhead implementation are the concrete new pieces. If the experiments on SD 1.5, SDXL, SD3.5, CogVideoX and ModelScope actually show better semantic alignment and temporal smoothness without extra cost, the method itself could be worth trying in practice. The soft spot is exactly where the stress-test note flags it: the paper moves from the geometric upper bound to the point-wise energy rule without a visible inequality or derivation that ties the two together. The energy looks like a normalized difference of conditional and unconditional scores, which is a reasonable heuristic but not obviously the quantity that respects the manifold bound they derive. Without those steps the theoretical justification is weaker than presented. The rest of the paper looks like standard CFG literature plus the new algorithm, with no obvious circularity in the reported results. This is for people who implement or tune diffusion sampling and want a lightweight adaptive trick. A reader who cares about practical improvements to guidance would get value from the method description and the reported qualitative gains even if the geometry story needs tightening. I would send it to peer review because the idea is concrete, the claims are falsifiable, and the experiments span multiple models.

Referee Report

3 major / 2 minor

Summary. The paper claims that Classifier-Free Guidance (CFG) in diffusion models performs tangential linear extrapolation on a curved data manifold (via Tweedie's formula), causing orthogonal deviations that manifest as the detail-artifact dilemma. It derives a theoretical upper bound on guidance scales from differential geometry to keep trajectories bounded, then introduces Spatial Adaptive Multi Guidance (SAMG): a training-free sampler that computes point-wise conditional guidance energy and applies conservative (min) scales to high-energy regions while using aggressive (max) scales in low-energy regions. Experiments on image (SD 1.5/ XL/ 3.5) and video (CogVideoX, ModelScope) models reportedly show improved semantic alignment, structure, and temporal consistency at no extra cost.

Significance. If the geometric upper bound is rigorously derived and the adaptive energy rule is shown to respect it, SAMG would offer a principled, zero-overhead improvement to a core sampling technique used across generative vision. The training-free nature and multi-architecture validation are strengths; a sound manifold-based justification could influence guidance design beyond ad-hoc scaling.

major comments (3)

[§3] §3 (geometric analysis): The upper bound on orthogonal deviation is derived from the curvature of the data manifold and Tweedie's formula, but the subsequent definition of 'conditional guidance energy' (used to decide min vs. max scale per pixel) is not shown to be bounded by or equivalent to this quantity. The energy appears to be an empirical score difference without an explicit inequality linking it to the manifold deviation bound, so the claim that SAMG 'keeps trajectories safely bounded' rests on an unverified identification rather than a proven relation.
[§4.1] §4.1 (SAMG algorithm): The adaptive rule applies min_scale to high-energy boundary regions and max_scale to low-energy regions, yet no ablation or sensitivity analysis is provided on how the energy threshold or the specific min/max values (free parameters) affect the bound; if the energy is not monotonically related to deviation magnitude, the rule could violate the theoretical upper bound in some regions.
[Experiments] Experiments section: While qualitative and quantitative results are reported across SD 1.5, SDXL, SD3.5, CogVideoX and ModelScope, the paper does not include a direct comparison of the proposed energy against alternative proxies (e.g., gradient magnitude or reconstruction error) to confirm it is the correct indicator of required scale; without this, superiority over uniform CFG or other adaptive baselines may be overstated.

minor comments (2)

[§4] Notation for the conditional guidance energy should be introduced with an explicit equation number and its normalization clarified (e.g., whether it is L2-normed or per-channel).
[Abstract] The abstract states 'virtually zero-cost' but the method requires an extra forward pass for the unconditional score at each step; clarify the exact overhead relative to standard CFG.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the careful and constructive review. The comments identify important areas where the connection between the geometric analysis and the SAMG algorithm can be made more rigorous, and where additional experimental validation would strengthen the claims. We address each major comment point by point below and will revise the manuscript to incorporate the suggested clarifications and analyses.

read point-by-point responses

Referee: [§3] §3 (geometric analysis): The upper bound on orthogonal deviation is derived from the curvature of the data manifold and Tweedie's formula, but the subsequent definition of 'conditional guidance energy' (used to decide min vs. max scale per pixel) is not shown to be bounded by or equivalent to this quantity. The energy appears to be an empirical score difference without an explicit inequality linking it to the manifold deviation bound, so the claim that SAMG 'keeps trajectories safely bounded' rests on an unverified identification rather than a proven relation.

Authors: We appreciate the referee's precise identification of this gap. Section 3 derives the upper bound on orthogonal deviation from the curvature term in the local approximation of Tweedie's formula and shows that uniform CFG produces deviations orthogonal to the manifold. The conditional guidance energy is introduced as the point-wise score difference, which is the quantity that drives the orthogonal component under the same local linearization. While the manuscript presents this as a direct link, we acknowledge that an explicit inequality establishing that the energy is bounded by (or proportional to) the curvature-derived deviation term was not formally stated. In the revision we will add a short derivation in §3 showing that, under the first-order approximation used for the bound, the energy is monotonically related to the deviation magnitude, thereby justifying that conservative scaling in high-energy regions keeps trajectories within the derived bound. revision: yes
Referee: [§4.1] §4.1 (SAMG algorithm): The adaptive rule applies min_scale to high-energy boundary regions and max_scale to low-energy regions, yet no ablation or sensitivity analysis is provided on how the energy threshold or the specific min/max values (free parameters) affect the bound; if the energy is not monotonically related to deviation magnitude, the rule could violate the theoretical upper bound in some regions.

Authors: We agree that the absence of sensitivity analysis leaves the robustness of the min/max rule open to question. The design choice (min scale on high-energy regions, max on low-energy) follows from the geometric observation that high-energy locations are where the orthogonal deviation is largest. To address the concern directly, the revised manuscript will include an ablation subsection in §4.1 that (i) varies the energy threshold and the concrete min/max values over a range, (ii) reports the resulting deviation metrics on both synthetic curved manifolds and real diffusion trajectories, and (iii) verifies that the chosen rule never exceeds the theoretical upper bound derived in §3. We will also add a brief monotonicity check (empirical correlation between energy and measured deviation) to confirm the rule does not inadvertently violate the bound. revision: yes
Referee: [Experiments] Experiments section: While qualitative and quantitative results are reported across SD 1.5, SDXL, SD3.5, CogVideoX and ModelScope, the paper does not include a direct comparison of the proposed energy against alternative proxies (e.g., gradient magnitude or reconstruction error) to confirm it is the correct indicator of required scale; without this, superiority over uniform CFG or other adaptive baselines may be overstated.

Authors: We accept that a direct comparison of the conditional guidance energy against other candidate proxies would provide stronger evidence that it is the most appropriate indicator. The current experiments demonstrate consistent gains over uniform CFG and several published adaptive baselines, but they do not benchmark the energy metric itself against alternatives such as gradient magnitude or reconstruction error. In the revised version we will add a targeted comparison in the Experiments section: for each model we will compute three proxies (our energy, gradient magnitude, and reconstruction error), apply the same min/max adaptive rule with each proxy, and report both quantitative metrics (FID, CLIP score, temporal consistency) and qualitative examples. This will allow readers to see which proxy best correlates with reduced artifacts while preserving semantic alignment. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper's chain proceeds from a standard reference (Tweedie's formula) to an interpretive geometric analysis of CFG as tangential extrapolation, followed by formulation of a theoretical upper bound on orthogonal deviation and a heuristic adaptive rule (point-wise conditional guidance energy) that applies min/max scales spatially. No equation is shown to equal a fitted input by construction, no load-bearing premise reduces to a self-citation whose validity depends on the present work, and the proposed SAMG algorithm is presented as a training-free heuristic motivated by the geometric insight rather than a statistical prediction forced by data reuse. The derivation therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The central claim rests on a geometric reinterpretation of CFG and an unverified upper bound; no independent evidence or shipped code is supplied in the abstract.

free parameters (2)

min_scale
Conservative guidance scale applied to high-energy boundary regions; value not specified in abstract.
max_scale
Aggressive guidance scale applied to low-energy regions; value not specified in abstract.

axioms (1)

domain assumption CFG performs tangential linear extrapolation on a highly curved natural data manifold
Invoked via analysis of Tweedie's Formula in the abstract.

pith-pipeline@v0.9.0 · 5583 in / 1278 out tokens · 66517 ms · 2026-05-07T11:02:21.220036+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Before the Body Moves: Learning Anticipatory Joint Intent for Language-Conditioned Humanoid Control
cs.RO 2026-05 unverdicted novelty 5.0

DAJI learns future-aware joint intents from language to enable proactive humanoid control, reporting 94.42% rollout success on HumanML3D-style tasks and 0.152 subsequence FID on BABEL.

Reference graph

Works this paper leans on

39 extracted references · 30 canonical work pages · cited by 1 Pith paper · 17 internal anchors

[1]

Self-rectifying diffusion sampling with perturbed-attention guidance.arXiv preprint arXiv:2403.17377, 2024

Donghoon Ahn, Hyoungwon Cho, Jaewon Min, Wooseok Jang, Jungwoo Kim, SeonHwa Kim, Hyun Hee Park, Kyong Hwan Jin, and Seungryong Kim. Self-rectifying diffusion sampling with perturbed-attention guidance.arXiv preprint arXiv:2403.17377, 2024

work page arXiv 2024
[2]

Analytic-dpm: an analytic estimate of the optimal reverse variance in diffusion probabilistic models, 2022

Fan Bao, Chongxuan Li, Jun Zhu, and Bo Zhang. Analytic-dpm: an analytic estimate of the optimal reverse variance in diffusion probabilistic models, 2022. URLhttps://arxiv.org/ abs/2201.06503

work page arXiv 2022
[3]

De Bortoli, E

Valentin De Bortoli, Emile Mathieu, Michael Hutchinson, James Thornton, Yee Whye Teh, and Arnaud Doucet. Riemannian score-based generative modelling, 2022. URLhttps:// arxiv.org/abs/2202.02763

work page arXiv 2022
[4]

Available: https://arxiv.org/abs/2406.08070

Hyungjin Chung, Jeongsol Kim, Geon Yeong Park, Hyelin Nam, and Jong Chul Ye. Cfg++: Manifold-constrained classifier free guidance for diffusion models, 2024. URLhttps://arxiv. org/abs/2406.08070. 23

work page arXiv 2024
[5]

Scaling Rectified Flow Transformers for High-Resolution Image Synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas M¨ uller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, Dustin Podell, Tim Dockhorn, Zion English, Kyle Lacey, Alex Goodwin, Yannik Marek, and Robin Rombach. Scaling rectified flow transformers for high-resolution image synthesis, 2024. URLhttps://arxiv.org/abs/ 2403.03206

work page internal anchor Pith review arXiv 2024
[6]

Yeh, and Ziwei Liu

Weichen Fan, Amber Yijia Zheng, Raymond A. Yeh, and Ziwei Liu. Cfg-zero*: Improved classifier-free guidance for flow matching models, 2025. URLhttps://arxiv.org/abs/2503. 18886

2025
[7]

Geneval: An object-focused frame- work for evaluating text-to-image alignment, 2023

Dhruba Ghosh, Hanna Hajishirzi, and Ludwig Schmidt. Geneval: An object-focused frame- work for evaluating text-to-image alignment, 2023. URLhttps://arxiv.org/abs/2310. 11513

2023
[8]

I. J. Good.Introduction to Robbins (1955) An Empirical Bayes Approach to Statistics, pages 379–387. Springer New York, New York, NY, 1992. ISBN 978-1-4612-0919-5. doi: 10.1007/ 978-1-4612-0919-5 25. URLhttps://doi.org/10.1007/978-1-4612-0919-5_25

work page doi:10.1007/978-1-4612-0919-5_25 1955
[9]

Classifier-Free Diffusion Guidance

Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance, 2022. URLhttps://arxiv. org/abs/2207.12598

work page internal anchor Pith review arXiv 2022
[10]

Denoising Diffusion Probabilistic Models

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models, 2020. URLhttps://arxiv.org/abs/2006.11239

work page internal anchor Pith review arXiv 2020
[11]

Elucidating the Design Space of Diffusion-Based Generative Models

Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the design space of diffusion-based generative models, 2022. URLhttps://arxiv.org/abs/2206.00364

work page internal anchor Pith review arXiv 2022
[12]

Pick-a-pic: An open dataset of user preferences for text-to-image generation, 2023

Yuval Kirstain, Adam Polyak, Uriel Singer, Shahbuland Matiana, Joe Penna, and Omer Levy. Pick-a-pic: An open dataset of user preferences for text-to-image generation, 2023. URL https://arxiv.org/abs/2305.01569

work page arXiv 2023
[13]

Imagenet classification with deep convolutional neural networks

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In F. Pereira, C.J. Burges, L. Bottou, and K.Q. Wein- berger, editors,Advances in Neural Information Processing Systems, volume 25. Curran As- sociates, Inc., 2012. URLhttps://proceedings.neurips.cc/paper_files/paper/2012/ file/c399862d...

2012
[14]

arXiv:2305.08891 , year=

Shanchuan Lin, Bingchen Liu, Jiashi Li, and Xiao Yang. Common diffusion noise schedules and sample steps are flawed, 2024. URLhttps://arxiv.org/abs/2305.08891

work page arXiv 2024
[15]

Microsoft COCO: Common Objects in Context

Tsung-Yi Lin, Michael Maire, Serge Belongie, Lubomir Bourdev, Ross Girshick, James Hays, Pietro Perona, Deva Ramanan, C. Lawrence Zitnick, and Piotr Doll´ ar. Microsoft coco: Com- mon objects in context, 2015. URLhttps://arxiv.org/abs/1405.0312

work page internal anchor Pith review arXiv 2015
[16]

Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling, 2023. URLhttps://arxiv.org/abs/2210.02747

work page internal anchor Pith review arXiv 2023
[17]

Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow, 2022. URLhttps://arxiv.org/abs/2209.03003

work page internal anchor Pith review arXiv 2022
[18]

Understanding ssim.arXiv preprint arXiv:2006.13846, 2020

Jim Nilsson and Tomas Akenine-M¨ oller. Understanding ssim, 2020. URLhttps://arxiv. org/abs/2006.13846. 24

work page arXiv 2020
[19]

SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas M¨ uller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis, 2023. URLhttps://arxiv.org/abs/2307.01952

work page internal anchor Pith review arXiv 2023
[20]

The intrinsic dimension of images and its impact on learning

Phillip Pope, Chen Zhu, Ahmed Abdelkader, Micah Goldblum, and Tom Goldstein. The intrinsic dimension of images and its impact on learning, 2021. URLhttps://arxiv.org/ abs/2104.08894

work page arXiv 2021
[21]

Learning Transferable Visual Models From Natural Language Supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agar- wal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision, 2021. URL https://arxiv.org/abs/2103.00020

work page internal anchor Pith review arXiv 2021
[22]

Neural Computation , volume=

Martin Raphan and Eero P. Simoncelli. Least squares estimation without priors or supervision. Neural Computation, 23(2):374–420, 2011. doi: 10.1162/NECO a 00076

work page doi:10.1162/neco 2011
[23]

High-Resolution Image Synthesis with Latent Diffusion Models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj¨ orn Ommer. High-resolution image synthesis with latent diffusion models, 2022. URLhttps://arxiv. org/abs/2112.10752

work page internal anchor Pith review arXiv 2022
[24]

Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding

Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Denton, Seyed Kamyar Seyed Ghasemipour, Burcu Karagol Ayan, S. Sara Mahdavi, Rapha Gontijo Lopes, Tim Salimans, Jonathan Ho, David J Fleet, and Mohammad Norouzi. Photorealistic text-to- image diffusion models with deep language understanding, 2022. URLhttps://arxiv.org/ abs/2205.11487

work page internal anchor Pith review arXiv 2022
[25]

LAION-5B: An open large-scale dataset for training next generation image-text models

Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, Patrick Schramowski, Srivatsa Kundurthy, Katherine Crowson, Ludwig Schmidt, Robert Kaczmar- czyk, and Jenia Jitsev. Laion-5b: An open large-scale dataset for training next generation image-text model...

work page internal anchor Pith review arXiv 2022
[26]

pytorch-fid: FID Score for PyTorch.https://github.com/mseitzer/ pytorch-fid, August 2020

Maximilian Seitzer. pytorch-fid: FID Score for PyTorch.https://github.com/mseitzer/ pytorch-fid, August 2020. Version 0.3.0

2020
[27]

Rethinking the spatial inconsistency in classifier-free diffusion guidance, 2024

Dazhong Shen, Guanglu Song, Zeyue Xue, Fu-Yun Wang, and Yu Liu. Rethinking the spatial inconsistency in classifier-free diffusion guidance, 2024. URLhttps://arxiv.org/abs/2404. 05384

2024
[28]

Freeu: Free lunch in diffusion u-net,

Chenyang Si, Ziqi Huang, Yuming Jiang, and Ziwei Liu. Freeu: Free lunch in diffusion u-net,
[29]

URLhttps://arxiv.org/abs/2309.11497

work page arXiv
[30]

Denoising Diffusion Implicit Models

Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models, 2022. URLhttps://arxiv.org/abs/2010.02502

work page internal anchor Pith review arXiv 2022
[31]

Sliced score matching: A scalable approach to density and score estimation, 2019

Yang Song, Sahaj Garg, Jiaxin Shi, and Stefano Ermon. Sliced score matching: A scalable approach to density and score estimation, 2019. URLhttps://arxiv.org/abs/1905.07088

work page arXiv 2019
[32]

Score-Based Generative Modeling through Stochastic Differential Equations

Yang Song, Jascha Sohl-Dickstein, Diederik P. Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations, 2021. URLhttps://arxiv.org/abs/2011.13456. 25

work page internal anchor Pith review arXiv 2021
[33]

ModelScope Text-to-Video Technical Report

Jiuniu Wang, Hangjie Yuan, Dayou Chen, Yingya Zhang, Xiang Wang, and Shiwei Zhang. Modelscope text-to-video technical report.arXiv preprint arXiv:2308.06571, 2023

work page internal anchor Pith review arXiv 2023
[34]

Human Preference Score v2: A Solid Benchmark for Evaluating Human Preferences of Text-to-Image Synthesis

Xiaoshi Wu, Yiming Hao, Keqiang Sun, Yixiong Chen, Feng Zhu, Rui Zhao, and Hongsheng Li. Human preference score v2: A solid benchmark for evaluating human preferences of text- to-image synthesis.arXiv preprint arXiv:2306.09341, 2023

work page internal anchor Pith review arXiv 2023
[35]

Imagereward: Learning and evaluating human preferences for text-to-image generation,

Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai Li, Ming Ding, Jie Tang, and Yuxiao Dong. Imagereward: Learning and evaluating human preferences for text-to-image generation,
[36]

URLhttps://arxiv.org/abs/2304.05977

work page arXiv
[37]

CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072, 2024

work page internal anchor Pith review arXiv 2024
[38]

Chronomagic-bench: A benchmark for meta- morphic evaluation of text-to-time-lapse video generation.Advances in Neural Information Processing Systems, 37:21236–21270, 2024

Shenghai Yuan, Jinfa Huang, Yongqi Xu, Yaoyang Liu, Shaofeng Zhang, Yujun Shi, Rui-Jie Zhu, Xinhua Cheng, Jiebo Luo, and Li Yuan. Chronomagic-bench: A benchmark for meta- morphic evaluation of text-to-time-lapse video generation.Advances in Neural Information Processing Systems, 37:21236–21270, 2024

2024
[39]

The unrea- sonable effectiveness of deep features as a perceptual metric

Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unrea- sonable effectiveness of deep features as a perceptual metric. InCVPR, 2018. 26

2018