hub Canonical reference

Inference-Time Scaling for Diffusion Models beyond Scaling Denoising Steps

Nanye Ma, Shangyuan Tong, Haolin Jia, Hexiang Hu, Yu-Chuan Su, Mingda Zhang · 2025 · cs.CV · arXiv 2501.09732

Canonical reference. 80% of citing Pith papers cite this work as background.

22 Pith papers citing it

Background 80% of classified citations

open full Pith review browse 22 citing papers arXiv PDF

abstract

Generative models have made significant impacts across various domains, largely due to their ability to scale during training by increasing data, computational resources, and model size, a phenomenon characterized by the scaling laws. Recent research has begun to explore inference-time scaling behavior in Large Language Models (LLMs), revealing how performance can further improve with additional computation during inference. Unlike LLMs, diffusion models inherently possess the flexibility to adjust inference-time computation via the number of denoising steps, although the performance gains typically flatten after a few dozen. In this work, we explore the inference-time scaling behavior of diffusion models beyond increasing denoising steps and investigate how the generation performance can further improve with increased computation. Specifically, we consider a search problem aimed at identifying better noises for the diffusion sampling process. We structure the design space along two axes: the verifiers used to provide feedback, and the algorithms used to find better noise candidates. Through extensive experiments on class-conditioned and text-conditioned image generation benchmarks, our findings reveal that increasing inference-time compute leads to substantial improvements in the quality of samples generated by diffusion models, and with the complicated nature of images, combinations of the components in the framework can be specifically chosen to conform with different application scenario.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 4 baseline 1

citation-polarity summary

background 4 baseline 1

representative citing papers

Do generative video models understand physical principles?

cs.CV · 2025-01-14 · unverdicted · novelty 8.0

Physics-IQ benchmark reveals that generative video models exhibit limited physical understanding unrelated to their visual quality.

TeDiO: Temporal Diagonal Optimization for Training-Free Coherent Video Diffusion

cs.CV · 2026-05-13 · unverdicted · novelty 7.0

TeDiO regularizes temporal diagonals in diffusion transformer attention maps to produce smoother video motion while keeping per-frame quality intact.

Oracle Noise: Faster Semantic Spherical Alignment for Interpretable Latent Optimization

cs.CV · 2026-04-26 · unverdicted · novelty 7.0

Oracle Noise optimizes diffusion model noise on a Riemannian hypersphere guided by key prompt words to preserve the Gaussian prior, eliminate norm inflation, and achieve faster semantic alignment than Euclidean methods.

Personalizing Text-to-Image Generation to Individual Taste

cs.CV · 2026-04-08 · unverdicted · novelty 7.0

PAMELA provides a multi-user rating dataset and personalized reward model that predicts individual image preferences more accurately than prior population-level aesthetic models.

PhaseFlow4D: Physically Constrained 4D Beam Reconstruction via Feedback-Guided Latent Diffusion

physics.acc-ph · 2026-04-04 · unverdicted · novelty 7.0

PhaseFlow4D reconstructs and tracks 4D beam phase space from 2D projections via a latent diffusion model with built-in physics constraints, achieving 11000x speedup over full simulations while following time-varying conditions.

Reflective Flow Sampling Enhancement

cs.CV · 2026-03-06 · unverdicted · novelty 7.0

RF-Sampling enhances flow matching models by implicitly performing gradient ascent on text-image alignment scores via linear textual combinations and flow inversion.

It's Never Too Late: Noise Optimization for Collapse Recovery in Trained Diffusion Models

cs.CV · 2025-12-31 · unverdicted · novelty 7.0

Noise optimization during sampling recovers diversity in mode-collapsed diffusion models while preserving output fidelity.

Iterative Inference-time Scaling with Adaptive Frequency Steering for Image Super-Resolution

cs.CV · 2025-12-29 · unverdicted · novelty 7.0

IAFS is a training-free iterative inference-time scaling framework that uses adaptive frequency-aware particle fusion to resolve the perception-fidelity conflict in diffusion super-resolution models, outperforming prior scaling strategies.

dMLLM-TTS: Self-Verified and Efficient Test-Time Scaling for Diffusion Multi-Modal Large Language Models

cs.CV · 2025-12-22 · conditional · novelty 7.0

dMLLM-TTS delivers up to 6x more efficient test-time scaling for diffusion MLLMs via O(N+T) hierarchical search and self-verified feedback, improving generation quality on GenEval across three models.

Inference-Time Scaling of Diffusion Language Models via Trajectory Refinement

cs.LG · 2025-07-11 · conditional · novelty 7.0

PG-DLM applies particle Gibbs sampling over full trajectories in diffusion language models to enable iterative refinement, yielding higher accuracy on reward-guided generation with theoretical convergence guarantees.

In-Context Edit: Enabling Instructional Image Editing with In-Context Generation in Large Scale Diffusion Transformer

cs.CV · 2025-04-29 · unverdicted · novelty 7.0

ICEdit achieves state-of-the-art instructional image editing in Diffusion Transformers via in-context generation, requiring only 0.1% of prior training data and 1% trainable parameters.

LatentBox: Storing AI-Generated Images at Scale via a Latent-First Design

cs.DC · 2026-05-19 · unverdicted · novelty 6.0 · 2 refs

LatentBox is a latent-first storage system that cuts persistent storage for AI images by 78.7% while keeping mean and tail latency competitive with traditional pixel storage.

Enhancing Train-Free Infinite-Frame Generation for Consistent Long Videos

cs.CV · 2026-05-18 · unverdicted · novelty 6.0

MIGA introduces two-stage alignment to close train-inference gaps and dual consistency enhancement via self-reflection and long-range guidance to achieve SOTA temporal consistency in infinite-frame video generation on VBench and NarrLV.

EPIC: Efficient Predicate-Guided Inference-Time Control for Compositional Text-to-Image Generation

cs.CV · 2026-05-12 · unverdicted · novelty 6.0

EPIC introduces predicate-guided inference-time search that lifts compositional T2I prompt accuracy from 34% to 71% on GenEval2 with 31-81% lower execution costs.

Think, then Score: Decoupled Reasoning and Scoring for Video Reward Modeling

cs.CV · 2026-05-07 · unverdicted · novelty 6.0

DeScore decouples CoT reasoning from reward scoring in video reward models using a two-stage training process to improve generalization and avoid optimization bottlenecks of coupled generative RMs.

FASTER: Value-Guided Sampling for Fast RL

cs.LG · 2026-04-21 · unverdicted · novelty 6.0

FASTER models multi-candidate denoising as an MDP and trains a value function to filter actions early, delivering the performance of full sampling at lower cost in diffusion RL policies.

FRIGID: Scaling Diffusion-Based Molecular Generation from Mass Spectra at Training and Inference Time

cs.LG · 2026-04-17 · unverdicted · novelty 6.0

FRIGID scales a diffusion-based model for de novo molecular structure generation from mass spectra, reaching over 18% top-1 accuracy on MassSpecGym and tripling prior bests on NPLIB1 via large unlabeled training and inference-time fragmentation refinement with log-linear compute scaling.

VASR: Variance-Aware Systematic Resampling for Reward-Guided Diffusion

cs.AI · 2026-04-08 · unverdicted · novelty 6.0 · 2 refs

VASR separates continuation and residual variance in reward-guided diffusion SMC, using optimal mass allocation and systematic resampling to achieve up to 26% better FID scores and faster runtimes than prior SMC and MCTS methods.

RSEdit: Text-Guided Image Editing for Remote Sensing

cs.CV · 2026-03-14 · unverdicted · novelty 6.0

RSEdit adapts off-the-shelf text-to-image models into a collection of editing systems that follow text instructions while keeping geospatial structure intact in remote sensing images.

RAPO++: Cross-Stage Prompt Optimization for Text-to-Video Generation via Data Alignment and Test-Time Scaling

cs.CV · 2025-10-23 · unverdicted · novelty 6.0

RAPO++ is a three-stage prompt optimization framework combining retrieval-augmented refinement, closed-loop test-time scaling, and LLM fine-tuning to enhance text-to-video generation quality.

CellFluxRL: Biologically-Constrained Virtual Cell Modeling via Reinforcement Learning

cs.LG · 2026-03-23 · unverdicted · novelty 5.0 · 2 refs

CellFluxRL post-trains the CellFlux model with RL using seven biological reward functions to generate more biologically valid virtual cell images.

The Serial Scaling Hypothesis

cs.LG · 2025-07-16 · unverdicted · novelty 5.0

The serial scaling hypothesis formalizes inherently serial problems in complexity theory and demonstrates that diffusion models cannot solve them.

citing papers explorer

Showing 22 of 22 citing papers.

Do generative video models understand physical principles? cs.CV · 2025-01-14 · unverdicted · none · ref 71 · internal anchor
Physics-IQ benchmark reveals that generative video models exhibit limited physical understanding unrelated to their visual quality.
TeDiO: Temporal Diagonal Optimization for Training-Free Coherent Video Diffusion cs.CV · 2026-05-13 · unverdicted · none · ref 34 · internal anchor
TeDiO regularizes temporal diagonals in diffusion transformer attention maps to produce smoother video motion while keeping per-frame quality intact.
Oracle Noise: Faster Semantic Spherical Alignment for Interpretable Latent Optimization cs.CV · 2026-04-26 · unverdicted · none · ref 25 · internal anchor
Oracle Noise optimizes diffusion model noise on a Riemannian hypersphere guided by key prompt words to preserve the Gaussian prior, eliminate norm inflation, and achieve faster semantic alignment than Euclidean methods.
Personalizing Text-to-Image Generation to Individual Taste cs.CV · 2026-04-08 · unverdicted · none · ref 35 · internal anchor
PAMELA provides a multi-user rating dataset and personalized reward model that predicts individual image preferences more accurately than prior population-level aesthetic models.
PhaseFlow4D: Physically Constrained 4D Beam Reconstruction via Feedback-Guided Latent Diffusion physics.acc-ph · 2026-04-04 · unverdicted · none · ref 16 · internal anchor
PhaseFlow4D reconstructs and tracks 4D beam phase space from 2D projections via a latent diffusion model with built-in physics constraints, achieving 11000x speedup over full simulations while following time-varying conditions.
Reflective Flow Sampling Enhancement cs.CV · 2026-03-06 · unverdicted · none · ref 56 · internal anchor
RF-Sampling enhances flow matching models by implicitly performing gradient ascent on text-image alignment scores via linear textual combinations and flow inversion.
It's Never Too Late: Noise Optimization for Collapse Recovery in Trained Diffusion Models cs.CV · 2025-12-31 · unverdicted · none · ref 40 · internal anchor
Noise optimization during sampling recovers diversity in mode-collapsed diffusion models while preserving output fidelity.
Iterative Inference-time Scaling with Adaptive Frequency Steering for Image Super-Resolution cs.CV · 2025-12-29 · unverdicted · none · ref 23 · internal anchor
IAFS is a training-free iterative inference-time scaling framework that uses adaptive frequency-aware particle fusion to resolve the perception-fidelity conflict in diffusion super-resolution models, outperforming prior scaling strategies.
dMLLM-TTS: Self-Verified and Efficient Test-Time Scaling for Diffusion Multi-Modal Large Language Models cs.CV · 2025-12-22 · conditional · none · ref 17 · internal anchor
dMLLM-TTS delivers up to 6x more efficient test-time scaling for diffusion MLLMs via O(N+T) hierarchical search and self-verified feedback, improving generation quality on GenEval across three models.
Inference-Time Scaling of Diffusion Language Models via Trajectory Refinement cs.LG · 2025-07-11 · conditional · none · ref 28 · internal anchor
PG-DLM applies particle Gibbs sampling over full trajectories in diffusion language models to enable iterative refinement, yielding higher accuracy on reward-guided generation with theoretical convergence guarantees.
In-Context Edit: Enabling Instructional Image Editing with In-Context Generation in Large Scale Diffusion Transformer cs.CV · 2025-04-29 · unverdicted · none · ref 29 · internal anchor
ICEdit achieves state-of-the-art instructional image editing in Diffusion Transformers via in-context generation, requiring only 0.1% of prior training data and 1% trainable parameters.
LatentBox: Storing AI-Generated Images at Scale via a Latent-First Design cs.DC · 2026-05-19 · unverdicted · none · ref 37 · 2 links · internal anchor
LatentBox is a latent-first storage system that cuts persistent storage for AI images by 78.7% while keeping mean and tail latency competitive with traditional pixel storage.
Enhancing Train-Free Infinite-Frame Generation for Consistent Long Videos cs.CV · 2026-05-18 · unverdicted · none · ref 20 · internal anchor
MIGA introduces two-stage alignment to close train-inference gaps and dual consistency enhancement via self-reflection and long-range guidance to achieve SOTA temporal consistency in infinite-frame video generation on VBench and NarrLV.
EPIC: Efficient Predicate-Guided Inference-Time Control for Compositional Text-to-Image Generation cs.CV · 2026-05-12 · unverdicted · none · ref 10 · internal anchor
EPIC introduces predicate-guided inference-time search that lifts compositional T2I prompt accuracy from 34% to 71% on GenEval2 with 31-81% lower execution costs.
Think, then Score: Decoupled Reasoning and Scoring for Video Reward Modeling cs.CV · 2026-05-07 · unverdicted · none · ref 21 · internal anchor
DeScore decouples CoT reasoning from reward scoring in video reward models using a two-stage training process to improve generalization and avoid optimization bottlenecks of coupled generative RMs.
FASTER: Value-Guided Sampling for Fast RL cs.LG · 2026-04-21 · unverdicted · none · ref 4 · internal anchor
FASTER models multi-candidate denoising as an MDP and trains a value function to filter actions early, delivering the performance of full sampling at lower cost in diffusion RL policies.
FRIGID: Scaling Diffusion-Based Molecular Generation from Mass Spectra at Training and Inference Time cs.LG · 2026-04-17 · unverdicted · none · ref 8 · internal anchor
FRIGID scales a diffusion-based model for de novo molecular structure generation from mass spectra, reaching over 18% top-1 accuracy on MassSpecGym and tripling prior bests on NPLIB1 via large unlabeled training and inference-time fragmentation refinement with log-linear compute scaling.
VASR: Variance-Aware Systematic Resampling for Reward-Guided Diffusion cs.AI · 2026-04-08 · unverdicted · none · ref 30 · 2 links · internal anchor
VASR separates continuation and residual variance in reward-guided diffusion SMC, using optimal mass allocation and systematic resampling to achieve up to 26% better FID scores and faster runtimes than prior SMC and MCTS methods.
RSEdit: Text-Guided Image Editing for Remote Sensing cs.CV · 2026-03-14 · unverdicted · none · ref 16 · internal anchor
RSEdit adapts off-the-shelf text-to-image models into a collection of editing systems that follow text instructions while keeping geospatial structure intact in remote sensing images.
RAPO++: Cross-Stage Prompt Optimization for Text-to-Video Generation via Data Alignment and Test-Time Scaling cs.CV · 2025-10-23 · unverdicted · none · ref 16 · internal anchor
RAPO++ is a three-stage prompt optimization framework combining retrieval-augmented refinement, closed-loop test-time scaling, and LLM fine-tuning to enhance text-to-video generation quality.
CellFluxRL: Biologically-Constrained Virtual Cell Modeling via Reinforcement Learning cs.LG · 2026-03-23 · unverdicted · none · ref 29 · 2 links · internal anchor
CellFluxRL post-trains the CellFlux model with RL using seven biological reward functions to generate more biologically valid virtual cell images.
The Serial Scaling Hypothesis cs.LG · 2025-07-16 · unverdicted · none · ref 66 · internal anchor
The serial scaling hypothesis formalizes inherently serial problems in complexity theory and demonstrates that diffusion models cannot solve them.

Inference-Time Scaling for Diffusion Models beyond Scaling Denoising Steps

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer