Recognition: unknown
Delta Score Matters! Spatial Adaptive Multi Guidance in Diffusion Models
Pith reviewed 2026-05-07 11:02 UTC · model grok-4.3
The pith
Point-wise adaptive guidance scales resolve the detail-artifact trade-off in diffusion models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
CFG performs a tangential linear extrapolation on the curved natural data manifold; the resulting orthogonal deviation is the source of the detail-artifact dilemma. SAMG derives a geometric upper bound on guidance and replaces the global scalar with point-wise conditional guidance energy, applying the conservative minimum scale at high-energy boundary regions to protect micro-textures and the aggressive maximum scale at low-energy regions to maximize semantic injection.
What carries the argument
Point-wise conditional guidance energy that selects between minimum and maximum guidance scales at each location so the sampling trajectory remains bounded on the data manifold.
Load-bearing premise
The point-wise conditional guidance energy correctly flags which regions need conservative versus aggressive scales, and the derived geometric upper bound actually keeps the trajectory on the manifold.
What would settle it
Generate the same prompts with and without SAMG on a fixed diffusion model, then measure whether SAMG fails to improve (or worsens) both semantic alignment scores and structural/temporal artifact metrics.
Figures
read the original abstract
Diffusion models have achieved remarkable success in synthesizing complex static and temporal visuals, a breakthrough largely driven by Classifier-Free Guidance (CFG). However, despite its pivotal role in aligning generated content with textual prompts, standard CFG relies on a globally uniform scalar. This homogeneous amplification traps models in a well-documented "detail-artifact dilemma": low guidance scales fail to inject intricate semantics, while high scales inevitably cause structural degradation, color over-saturation, and temporal inconsistencies in videos. In this paper, we expose the physical root of this flaw through the lens of differential geometry. By analyzing Tweedie's Formula, we reveal that CFG intrinsically performs a tangential linear extrapolation. Because the natural data manifold is highly curved, this uniform linear step introduces a severe orthogonal deviation. To keep the generation trajectory safely bounded, we formulate a theoretical upper bound for spatial and adaptive guidance. Based on these geometric insights, we propose Spatial Adaptive Multi Guidance (SAMG), a training-free and virtually zero-cost sampling algorithm. SAMG dynamically computes point-wise conditional guidance energy, applying a conservative minimum scale to high-energy boundary regions to preserve delicate micro-textures, while deploying an aggressive maximum scale in low-energy regions to maximize semantic injection. Extensive experiments across diverse image (SD 1.5, SDXL, SD3.5 Medium) and video (CogVideoX, ModelScope) architectures demonstrate that SAMG effectively resolves the detail-artifact dilemma, achieving superior semantic alignment, structural integrity, and temporal smoothness without any computational overhead.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that Classifier-Free Guidance (CFG) in diffusion models performs tangential linear extrapolation on a curved data manifold (via Tweedie's formula), causing orthogonal deviations that manifest as the detail-artifact dilemma. It derives a theoretical upper bound on guidance scales from differential geometry to keep trajectories bounded, then introduces Spatial Adaptive Multi Guidance (SAMG): a training-free sampler that computes point-wise conditional guidance energy and applies conservative (min) scales to high-energy regions while using aggressive (max) scales in low-energy regions. Experiments on image (SD 1.5/ XL/ 3.5) and video (CogVideoX, ModelScope) models reportedly show improved semantic alignment, structure, and temporal consistency at no extra cost.
Significance. If the geometric upper bound is rigorously derived and the adaptive energy rule is shown to respect it, SAMG would offer a principled, zero-overhead improvement to a core sampling technique used across generative vision. The training-free nature and multi-architecture validation are strengths; a sound manifold-based justification could influence guidance design beyond ad-hoc scaling.
major comments (3)
- [§3] §3 (geometric analysis): The upper bound on orthogonal deviation is derived from the curvature of the data manifold and Tweedie's formula, but the subsequent definition of 'conditional guidance energy' (used to decide min vs. max scale per pixel) is not shown to be bounded by or equivalent to this quantity. The energy appears to be an empirical score difference without an explicit inequality linking it to the manifold deviation bound, so the claim that SAMG 'keeps trajectories safely bounded' rests on an unverified identification rather than a proven relation.
- [§4.1] §4.1 (SAMG algorithm): The adaptive rule applies min_scale to high-energy boundary regions and max_scale to low-energy regions, yet no ablation or sensitivity analysis is provided on how the energy threshold or the specific min/max values (free parameters) affect the bound; if the energy is not monotonically related to deviation magnitude, the rule could violate the theoretical upper bound in some regions.
- [Experiments] Experiments section: While qualitative and quantitative results are reported across SD 1.5, SDXL, SD3.5, CogVideoX and ModelScope, the paper does not include a direct comparison of the proposed energy against alternative proxies (e.g., gradient magnitude or reconstruction error) to confirm it is the correct indicator of required scale; without this, superiority over uniform CFG or other adaptive baselines may be overstated.
minor comments (2)
- [§4] Notation for the conditional guidance energy should be introduced with an explicit equation number and its normalization clarified (e.g., whether it is L2-normed or per-channel).
- [Abstract] The abstract states 'virtually zero-cost' but the method requires an extra forward pass for the unconditional score at each step; clarify the exact overhead relative to standard CFG.
Simulated Author's Rebuttal
We thank the referee for the careful and constructive review. The comments identify important areas where the connection between the geometric analysis and the SAMG algorithm can be made more rigorous, and where additional experimental validation would strengthen the claims. We address each major comment point by point below and will revise the manuscript to incorporate the suggested clarifications and analyses.
read point-by-point responses
-
Referee: [§3] §3 (geometric analysis): The upper bound on orthogonal deviation is derived from the curvature of the data manifold and Tweedie's formula, but the subsequent definition of 'conditional guidance energy' (used to decide min vs. max scale per pixel) is not shown to be bounded by or equivalent to this quantity. The energy appears to be an empirical score difference without an explicit inequality linking it to the manifold deviation bound, so the claim that SAMG 'keeps trajectories safely bounded' rests on an unverified identification rather than a proven relation.
Authors: We appreciate the referee's precise identification of this gap. Section 3 derives the upper bound on orthogonal deviation from the curvature term in the local approximation of Tweedie's formula and shows that uniform CFG produces deviations orthogonal to the manifold. The conditional guidance energy is introduced as the point-wise score difference, which is the quantity that drives the orthogonal component under the same local linearization. While the manuscript presents this as a direct link, we acknowledge that an explicit inequality establishing that the energy is bounded by (or proportional to) the curvature-derived deviation term was not formally stated. In the revision we will add a short derivation in §3 showing that, under the first-order approximation used for the bound, the energy is monotonically related to the deviation magnitude, thereby justifying that conservative scaling in high-energy regions keeps trajectories within the derived bound. revision: yes
-
Referee: [§4.1] §4.1 (SAMG algorithm): The adaptive rule applies min_scale to high-energy boundary regions and max_scale to low-energy regions, yet no ablation or sensitivity analysis is provided on how the energy threshold or the specific min/max values (free parameters) affect the bound; if the energy is not monotonically related to deviation magnitude, the rule could violate the theoretical upper bound in some regions.
Authors: We agree that the absence of sensitivity analysis leaves the robustness of the min/max rule open to question. The design choice (min scale on high-energy regions, max on low-energy) follows from the geometric observation that high-energy locations are where the orthogonal deviation is largest. To address the concern directly, the revised manuscript will include an ablation subsection in §4.1 that (i) varies the energy threshold and the concrete min/max values over a range, (ii) reports the resulting deviation metrics on both synthetic curved manifolds and real diffusion trajectories, and (iii) verifies that the chosen rule never exceeds the theoretical upper bound derived in §3. We will also add a brief monotonicity check (empirical correlation between energy and measured deviation) to confirm the rule does not inadvertently violate the bound. revision: yes
-
Referee: [Experiments] Experiments section: While qualitative and quantitative results are reported across SD 1.5, SDXL, SD3.5, CogVideoX and ModelScope, the paper does not include a direct comparison of the proposed energy against alternative proxies (e.g., gradient magnitude or reconstruction error) to confirm it is the correct indicator of required scale; without this, superiority over uniform CFG or other adaptive baselines may be overstated.
Authors: We accept that a direct comparison of the conditional guidance energy against other candidate proxies would provide stronger evidence that it is the most appropriate indicator. The current experiments demonstrate consistent gains over uniform CFG and several published adaptive baselines, but they do not benchmark the energy metric itself against alternatives such as gradient magnitude or reconstruction error. In the revised version we will add a targeted comparison in the Experiments section: for each model we will compute three proxies (our energy, gradient magnitude, and reconstruction error), apply the same min/max adaptive rule with each proxy, and report both quantitative metrics (FID, CLIP score, temporal consistency) and qualitative examples. This will allow readers to see which proxy best correlates with reduced artifacts while preserving semantic alignment. revision: yes
Circularity Check
No significant circularity in derivation chain
full rationale
The paper's chain proceeds from a standard reference (Tweedie's formula) to an interpretive geometric analysis of CFG as tangential extrapolation, followed by formulation of a theoretical upper bound on orthogonal deviation and a heuristic adaptive rule (point-wise conditional guidance energy) that applies min/max scales spatially. No equation is shown to equal a fitted input by construction, no load-bearing premise reduces to a self-citation whose validity depends on the present work, and the proposed SAMG algorithm is presented as a training-free heuristic motivated by the geometric insight rather than a statistical prediction forced by data reuse. The derivation therefore remains self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
free parameters (2)
- min_scale
- max_scale
axioms (1)
- domain assumption CFG performs tangential linear extrapolation on a highly curved natural data manifold
Forward citations
Cited by 1 Pith paper
-
Before the Body Moves: Learning Anticipatory Joint Intent for Language-Conditioned Humanoid Control
DAJI learns future-aware joint intents from language to enable proactive humanoid control, reporting 94.42% rollout success on HumanML3D-style tasks and 0.152 subsequence FID on BABEL.
Reference graph
Works this paper leans on
-
[1]
Donghoon Ahn, Hyoungwon Cho, Jaewon Min, Wooseok Jang, Jungwoo Kim, SeonHwa Kim, Hyun Hee Park, Kyong Hwan Jin, and Seungryong Kim. Self-rectifying diffusion sampling with perturbed-attention guidance.arXiv preprint arXiv:2403.17377, 2024
-
[2]
Fan Bao, Chongxuan Li, Jun Zhu, and Bo Zhang. Analytic-dpm: an analytic estimate of the optimal reverse variance in diffusion probabilistic models, 2022. URLhttps://arxiv.org/ abs/2201.06503
-
[3]
Valentin De Bortoli, Emile Mathieu, Michael Hutchinson, James Thornton, Yee Whye Teh, and Arnaud Doucet. Riemannian score-based generative modelling, 2022. URLhttps:// arxiv.org/abs/2202.02763
-
[4]
Available: https://arxiv.org/abs/2406.08070
Hyungjin Chung, Jeongsol Kim, Geon Yeong Park, Hyelin Nam, and Jong Chul Ye. Cfg++: Manifold-constrained classifier free guidance for diffusion models, 2024. URLhttps://arxiv. org/abs/2406.08070. 23
-
[5]
Scaling Rectified Flow Transformers for High-Resolution Image Synthesis
Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas M¨ uller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, Dustin Podell, Tim Dockhorn, Zion English, Kyle Lacey, Alex Goodwin, Yannik Marek, and Robin Rombach. Scaling rectified flow transformers for high-resolution image synthesis, 2024. URLhttps://arxiv.org/abs/ 2403.03206
work page internal anchor Pith review arXiv 2024
-
[6]
Yeh, and Ziwei Liu
Weichen Fan, Amber Yijia Zheng, Raymond A. Yeh, and Ziwei Liu. Cfg-zero*: Improved classifier-free guidance for flow matching models, 2025. URLhttps://arxiv.org/abs/2503. 18886
2025
-
[7]
Geneval: An object-focused frame- work for evaluating text-to-image alignment, 2023
Dhruba Ghosh, Hanna Hajishirzi, and Ludwig Schmidt. Geneval: An object-focused frame- work for evaluating text-to-image alignment, 2023. URLhttps://arxiv.org/abs/2310. 11513
2023
-
[8]
I. J. Good.Introduction to Robbins (1955) An Empirical Bayes Approach to Statistics, pages 379–387. Springer New York, New York, NY, 1992. ISBN 978-1-4612-0919-5. doi: 10.1007/ 978-1-4612-0919-5 25. URLhttps://doi.org/10.1007/978-1-4612-0919-5_25
-
[9]
Classifier-Free Diffusion Guidance
Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance, 2022. URLhttps://arxiv. org/abs/2207.12598
work page internal anchor Pith review arXiv 2022
-
[10]
Denoising Diffusion Probabilistic Models
Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models, 2020. URLhttps://arxiv.org/abs/2006.11239
work page internal anchor Pith review arXiv 2020
-
[11]
Elucidating the Design Space of Diffusion-Based Generative Models
Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the design space of diffusion-based generative models, 2022. URLhttps://arxiv.org/abs/2206.00364
work page internal anchor Pith review arXiv 2022
-
[12]
Pick-a-pic: An open dataset of user preferences for text-to-image generation, 2023
Yuval Kirstain, Adam Polyak, Uriel Singer, Shahbuland Matiana, Joe Penna, and Omer Levy. Pick-a-pic: An open dataset of user preferences for text-to-image generation, 2023. URL https://arxiv.org/abs/2305.01569
-
[13]
Imagenet classification with deep convolutional neural networks
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In F. Pereira, C.J. Burges, L. Bottou, and K.Q. Wein- berger, editors,Advances in Neural Information Processing Systems, volume 25. Curran As- sociates, Inc., 2012. URLhttps://proceedings.neurips.cc/paper_files/paper/2012/ file/c399862d...
2012
-
[14]
Shanchuan Lin, Bingchen Liu, Jiashi Li, and Xiao Yang. Common diffusion noise schedules and sample steps are flawed, 2024. URLhttps://arxiv.org/abs/2305.08891
-
[15]
Microsoft COCO: Common Objects in Context
Tsung-Yi Lin, Michael Maire, Serge Belongie, Lubomir Bourdev, Ross Girshick, James Hays, Pietro Perona, Deva Ramanan, C. Lawrence Zitnick, and Piotr Doll´ ar. Microsoft coco: Com- mon objects in context, 2015. URLhttps://arxiv.org/abs/1405.0312
work page internal anchor Pith review arXiv 2015
-
[16]
Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling, 2023. URLhttps://arxiv.org/abs/2210.02747
work page internal anchor Pith review arXiv 2023
-
[17]
Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow
Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow, 2022. URLhttps://arxiv.org/abs/2209.03003
work page internal anchor Pith review arXiv 2022
-
[18]
Understanding ssim.arXiv preprint arXiv:2006.13846, 2020
Jim Nilsson and Tomas Akenine-M¨ oller. Understanding ssim, 2020. URLhttps://arxiv. org/abs/2006.13846. 24
-
[19]
SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis
Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas M¨ uller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis, 2023. URLhttps://arxiv.org/abs/2307.01952
work page internal anchor Pith review arXiv 2023
-
[20]
The intrinsic dimension of images and its impact on learning.arXiv preprint arXiv:2104.08894,
Phillip Pope, Chen Zhu, Ahmed Abdelkader, Micah Goldblum, and Tom Goldstein. The intrinsic dimension of images and its impact on learning, 2021. URLhttps://arxiv.org/ abs/2104.08894
-
[21]
Learning Transferable Visual Models From Natural Language Supervision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agar- wal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision, 2021. URL https://arxiv.org/abs/2103.00020
work page internal anchor Pith review arXiv 2021
-
[22]
Martin Raphan and Eero P. Simoncelli. Least squares estimation without priors or supervision. Neural Computation, 23(2):374–420, 2011. doi: 10.1162/NECO a 00076
-
[23]
High-Resolution Image Synthesis with Latent Diffusion Models
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj¨ orn Ommer. High-resolution image synthesis with latent diffusion models, 2022. URLhttps://arxiv. org/abs/2112.10752
work page internal anchor Pith review arXiv 2022
-
[24]
Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding
Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Denton, Seyed Kamyar Seyed Ghasemipour, Burcu Karagol Ayan, S. Sara Mahdavi, Rapha Gontijo Lopes, Tim Salimans, Jonathan Ho, David J Fleet, and Mohammad Norouzi. Photorealistic text-to- image diffusion models with deep language understanding, 2022. URLhttps://arxiv.org/ abs/2205.11487
work page internal anchor Pith review arXiv 2022
-
[25]
LAION-5B: An open large-scale dataset for training next generation image-text models
Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, Patrick Schramowski, Srivatsa Kundurthy, Katherine Crowson, Ludwig Schmidt, Robert Kaczmar- czyk, and Jenia Jitsev. Laion-5b: An open large-scale dataset for training next generation image-text model...
work page internal anchor Pith review arXiv 2022
-
[26]
pytorch-fid: FID Score for PyTorch.https://github.com/mseitzer/ pytorch-fid, August 2020
Maximilian Seitzer. pytorch-fid: FID Score for PyTorch.https://github.com/mseitzer/ pytorch-fid, August 2020. Version 0.3.0
2020
-
[27]
Rethinking the spatial inconsistency in classifier-free diffusion guidance, 2024
Dazhong Shen, Guanglu Song, Zeyue Xue, Fu-Yun Wang, and Yu Liu. Rethinking the spatial inconsistency in classifier-free diffusion guidance, 2024. URLhttps://arxiv.org/abs/2404. 05384
2024
-
[28]
Freeu: Free lunch in diffusion u-net,
Chenyang Si, Ziqi Huang, Yuming Jiang, and Ziwei Liu. Freeu: Free lunch in diffusion u-net,
- [29]
-
[30]
Denoising Diffusion Implicit Models
Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models, 2022. URLhttps://arxiv.org/abs/2010.02502
work page internal anchor Pith review arXiv 2022
-
[31]
Sliced score matching: A scalable approach to density and score estimation, 2019
Yang Song, Sahaj Garg, Jiaxin Shi, and Stefano Ermon. Sliced score matching: A scalable approach to density and score estimation, 2019. URLhttps://arxiv.org/abs/1905.07088
-
[32]
Score-Based Generative Modeling through Stochastic Differential Equations
Yang Song, Jascha Sohl-Dickstein, Diederik P. Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations, 2021. URLhttps://arxiv.org/abs/2011.13456. 25
work page internal anchor Pith review arXiv 2021
-
[33]
ModelScope Text-to-Video Technical Report
Jiuniu Wang, Hangjie Yuan, Dayou Chen, Yingya Zhang, Xiang Wang, and Shiwei Zhang. Modelscope text-to-video technical report.arXiv preprint arXiv:2308.06571, 2023
work page internal anchor Pith review arXiv 2023
-
[34]
Xiaoshi Wu, Yiming Hao, Keqiang Sun, Yixiong Chen, Feng Zhu, Rui Zhao, and Hongsheng Li. Human preference score v2: A solid benchmark for evaluating human preferences of text- to-image synthesis.arXiv preprint arXiv:2306.09341, 2023
work page internal anchor Pith review arXiv 2023
-
[35]
Imagereward: Learning and evaluating human preferences for text-to-image generation,
Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai Li, Ming Ding, Jie Tang, and Yuxiao Dong. Imagereward: Learning and evaluating human preferences for text-to-image generation,
- [36]
-
[37]
CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer
Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072, 2024
work page internal anchor Pith review arXiv 2024
-
[38]
Chronomagic-bench: A benchmark for meta- morphic evaluation of text-to-time-lapse video generation.Advances in Neural Information Processing Systems, 37:21236–21270, 2024
Shenghai Yuan, Jinfa Huang, Yongqi Xu, Yaoyang Liu, Shaofeng Zhang, Yujun Shi, Rui-Jie Zhu, Xinhua Cheng, Jiebo Luo, and Li Yuan. Chronomagic-bench: A benchmark for meta- morphic evaluation of text-to-time-lapse video generation.Advances in Neural Information Processing Systems, 37:21236–21270, 2024
2024
-
[39]
The unrea- sonable effectiveness of deep features as a perceptual metric
Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unrea- sonable effectiveness of deep features as a perceptual metric. InCVPR, 2018. 26
2018
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.