Energy Scaling Laws for Diffusion Models: Quantifying Compute in Image Generation

Aniketh Iyengar; Boris Ruf; Jiaqi Han; Marcin Detyniecki; Stefano Ermon; Vincent Grari

arxiv: 2511.17031 · v2 · pith:TTWPTCLFnew · submitted 2025-11-21 · 💻 cs.LG · cs.CV· cs.CY

Energy Scaling Laws for Diffusion Models: Quantifying Compute in Image Generation

Aniketh Iyengar , Jiaqi Han , Boris Ruf , Vincent Grari , Marcin Detyniecki , Stefano Ermon This is my paper

Pith reviewed 2026-05-17 21:03 UTC · model grok-4.3

classification 💻 cs.LG cs.CVcs.CY

keywords diffusion modelsenergy consumptionscaling lawsGPU inferenceimage generationFLOPssustainable AI

0 comments

The pith

An adapted Kaplan scaling law predicts GPU energy use for diffusion models from FLOPs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that energy consumption in diffusion model inference follows a predictable scaling relationship with computational complexity measured in FLOPs. By decomposing inference into text encoding, repeated denoising, and decoding stages, the authors test the idea that the many denoising steps drive most of the energy draw. Experiments across four models and three GPUs, varying resolution, precision, step count, and guidance, show the law fits data with R-squared above 0.9 inside each setup and preserves model rankings even on new hardware. A reader would care because the relation supplies a practical way to forecast power needs and carbon costs before running large image generators.

Core claim

The central claim is that GPU energy consumption for diffusion model inference can be predicted from FLOPs using an adaptation of Kaplan scaling laws. The hypothesis that denoising operations dominate because they repeat across multiple steps is supported by measurements showing high predictive accuracy within architectures and strong rank correlations that allow reliable estimates for unseen model-hardware pairs.

What carries the argument

The adapted energy scaling law that relates total GPU energy to FLOPs after isolating the repeated denoising component as the main driver.

If this is right

Energy needs for new resolutions, precisions, or step counts can be estimated without running the model.
Rankings of energy efficiency remain consistent when moving models between different GPU types.
The relation supplies a basis for calculating carbon footprints of image generation workloads.
Diffusion inference behaves as a compute-bound process whose energy follows the same pattern across tested settings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The law could be used to simulate energy costs when increasing resolution or step counts in next-generation models.
Similar scaling approaches might apply to other iterative generation tasks such as video or audio synthesis.
Designers could tune step count or precision to lower energy while keeping acceptable image quality.
Quick energy estimates from the law would support regulatory reviews of large-scale AI environmental impact.

Load-bearing premise

Denoising operations dominate energy consumption due to their repeated execution across multiple inference steps.

What would settle it

Measure actual energy draw for a diffusion model on a previously untested GPU and configuration; a large mismatch with the value predicted from FLOPs would disprove the scaling law.

Figures

Figures reproduced from arXiv: 2511.17031 by Aniketh Iyengar, Boris Ruf, Jiaqi Han, Marcin Detyniecki, Stefano Ermon, Vincent Grari.

**Figure 2.** Figure 2: Diagnostic plots show actual versus predicted energy consumption for (a) Flux and (b) [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: The top row shows training on model pairs: (a) Qwen + SD 3.5, (b) Flux + SD 3.5, and (c) [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Diagnostic plots show (a) Stable Diffusion 2 on NVIDIA A100, illustrating U-Net scaling [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗

**Figure 5.** Figure 5: Cross-architecture experiments demonstrate generalization between U-Net and MMDiT [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗

**Figure 6.** Figure 6: Cross-architecture+gpu experiments demonstrate generalization between U-Net and [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗

**Figure 7.** Figure 7: Stable Diffusion 2 FLOP Derivation (per diffusion step) Conventions. Let h0 = H/8, w0 = W/8 be the initial latent dimensions. A multiply–add pair counts as 2 FLOPs. Totals are reported as GFLOPs via GFLOPs = FLOPs/109 . All expressions below are single forward-pass costs. The bespoke numbers added are biases that account for discrepancies between our atom-formulas and the actual architecture seen in SD2 (a… view at source ↗

**Figure 8.** Figure 8: Relative compute distribution across models and resolutions. For each model (FLUX, QWEN, and SD 3.5), we plot the proportion of total GFLOPs attributed to the iterative denoising process (scaled to 10 diffusion steps) versus the encoder/decoder overhead. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_8.png] view at source ↗

**Figure 9.** Figure 9: Energy scaling of Flux, relative to smallest energy setting (baseline). Flux has 38 layers in [PITH_FULL_IMAGE:figures/full_fig_p020_9.png] view at source ↗

**Figure 10.** Figure 10: Energy scaling of SD3-5, relative to smallest energy setting (baseline). SD3-5 has 38 [PITH_FULL_IMAGE:figures/full_fig_p021_10.png] view at source ↗

**Figure 11.** Figure 11: Energy scaling of SD2, relative to smallest energy setting (baseline). SD2 has 9 main [PITH_FULL_IMAGE:figures/full_fig_p022_11.png] view at source ↗

**Figure 12.** Figure 12: Energy scaling of Qwen, relative to smallest energy setting (baseline). Qwen has 60 layers [PITH_FULL_IMAGE:figures/full_fig_p023_12.png] view at source ↗

read the original abstract

The rapidly growing computational demands of diffusion models for image generation have raised significant concerns about energy consumption and environmental impact. While existing approaches to energy optimization focus on architectural improvements or hardware acceleration, there is a lack of principled methods to predict energy consumption across different model configurations and hardware setups. We propose an adaptation of Kaplan scaling laws to predict GPU energy consumption for diffusion models based on computational complexity (FLOPs). Our approach decomposes diffusion model inference into text encoding, iterative denoising, and decoding components, with the hypothesis that denoising operations dominate energy consumption due to their repeated execution across multiple inference steps. We conduct comprehensive experiments across four state-of-the-art diffusion models (Stable Diffusion 2, Stable Diffusion 3.5, Flux, and Qwen) on three GPU architectures (NVIDIA A100, A4000, A6000), spanning various inference configurations including resolution ($256^2$--$1024^2$), precision (fp16/fp32), step counts (10--50), and classifier-free guidance settings. Our energy scaling law achieves high predictive accuracy within individual architectures ($R^2 > 0.9$) and exhibits strong cross-architecture generalization, maintaining high rank correlations across models and enabling reliable energy estimation for unseen model--hardware combinations. These results validate the compute-bound nature of diffusion inference and establish energy consumption estimation as a necessary foundation for sustainable AI deployment planning and subsequent carbon footprint assessment.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Adapts scaling laws to energy prediction for diffusion inference with solid within-model fits but rests on an untested claim that denoising dominates consumption.

read the letter

This paper shows how to adapt scaling laws to predict energy draw for diffusion model inference, and the fits work well enough within the tested setups to be useful for planning. They run experiments on Stable Diffusion 2 and 3.5, Flux, and Qwen across A100, A4000, and A6000 GPUs, varying resolution, precision, steps, and guidance. The key move is treating total FLOPs as the driver and assuming denoising steps dominate because they repeat. They get R² > 0.9 inside each architecture and keep good rank correlations when crossing models or hardware. The data collection itself is the solid part. Measuring real energy on that many configurations gives a concrete baseline that earlier work on just FLOPs or architecture didn't have. Where it is thinner is the missing component-level energy traces. Without those, it's hard to know how much the assumption holds for models with heavy encoders like Flux, especially when step counts are low. The fitting procedure and any post-hoc decisions also aren't spelled out in the abstract, so reproducibility of the exact law is not immediate. Readers who care about sustainable deployment or carbon accounting for image generation will find this directly applicable. It is not a theoretical advance but a practical measurement study. I would send it to peer review because the experiments address a real gap and the results are falsifiable with more hardware data.

Referee Report

2 major / 2 minor

Summary. The manuscript adapts Kaplan-style scaling laws to predict GPU energy consumption for diffusion-based image generation from FLOPs. It decomposes inference into text encoding, iterative denoising, and decoding, with the hypothesis that denoising dominates due to repeated steps. Experiments cover four models (Stable Diffusion 2, Stable Diffusion 3.5, Flux, Qwen) and three GPUs (A100, A4000, A6000) across resolutions 256²–1024², fp16/fp32, 10–50 steps, and classifier-free guidance. Key claims are within-architecture R² > 0.9 and strong cross-architecture rank correlations enabling energy estimates for unseen model–hardware pairs.

Significance. If the central claims hold, the work supplies a practical tool for forecasting energy use and carbon impact in generative diffusion pipelines, supporting sustainable deployment planning. The breadth of the experimental design—four distinct models, three GPU architectures, and systematic variation of resolution, precision, steps, and guidance—is a clear strength that grounds the reported within- and cross-architecture results.

major comments (2)

[Abstract and §3] Abstract and §3: The hypothesis that 'denoising operations dominate energy consumption due to their repeated execution across multiple inference steps' is load-bearing for both the within-architecture R² fits and the cross-architecture rank correlations, yet no component-wise energy measurements (e.g., isolating text-encoder, denoiser, and decoder power draw) are reported. Without such validation, especially for large-encoder models like Flux or low step counts, the effective compute-to-energy mapping may be model- and configuration-dependent rather than universal.
[§5 (Results)] §5 (Results): The reported R² > 0.9 values are presented without error bars, explicit description of the fitting procedure (ordinary least squares? weighted?), or criteria for data-point inclusion/exclusion. These details are required to evaluate whether the high correlations are robust or sensitive to post-hoc choices, directly affecting confidence in the scaling-law claims.

minor comments (2)

[Notation] The notation for the fitted energy scaling coefficients should be introduced with an explicit equation in the main text (rather than only in the appendix) to improve readability.
[Figures] Figure captions for the energy-versus-FLOPs scatter plots should state the number of data points per series and whether any points were excluded.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful and constructive comments, which help clarify important aspects of our work. We address each major comment below and indicate the revisions we will make to the manuscript.

read point-by-point responses

Referee: [Abstract and §3] The hypothesis that 'denoising operations dominate energy consumption due to their repeated execution across multiple inference steps' is load-bearing for both the within-architecture R² fits and the cross-architecture rank correlations, yet no component-wise energy measurements (e.g., isolating text-encoder, denoiser, and decoder power draw) are reported. Without such validation, especially for large-encoder models like Flux or low step counts, the effective compute-to-energy mapping may be model- and configuration-dependent rather than universal.

Authors: We appreciate this observation. Our experiments vary the number of denoising steps (10–50) while keeping text encoding and decoding fixed for each configuration. The observed strong linear scaling of total energy with step count (reflected in R² > 0.9) provides indirect support for denoising dominance, as a constant contribution from the other components would not produce the proportional increase we measure. For Flux and similar models, the single-pass text encoding remains a fixed overhead that becomes relatively smaller at higher step counts, consistent with our cross-architecture rank correlations. We will revise §3 to explicitly articulate this reasoning, acknowledge the absence of isolated component measurements as a limitation (particularly at low step counts), and note that future work could include direct power profiling. This strengthens the manuscript without requiring new experiments. revision: partial
Referee: [§5 (Results)] The reported R² > 0.9 values are presented without error bars, explicit description of the fitting procedure (ordinary least squares? weighted?), or criteria for data-point inclusion/exclusion. These details are required to evaluate whether the high correlations are robust or sensitive to post-hoc choices, directly affecting confidence in the scaling-law claims.

Authors: We agree that these details are necessary for full transparency and reproducibility. The reported fits used ordinary least squares regression on log-log transformed data, following the standard Kaplan-style procedure, with all measured configurations included and no post-hoc exclusions. In the revised §5 we will (i) explicitly state the fitting method, (ii) report the number of data points per model–hardware pair, (iii) include error bars or standard errors on the regression coefficients and R² values, and (iv) add supplementary fit diagnostics such as residual plots or p-values. These additions will allow readers to assess robustness directly. revision: yes

Circularity Check

0 steps flagged

No significant circularity in energy scaling law derivation

full rationale

The paper adapts Kaplan scaling laws empirically by measuring energy and FLOPs across multiple diffusion models (SD2, SD3.5, Flux, Qwen), GPUs (A100/A4000/A6000), resolutions, precisions, step counts, and CFG settings, then fits a relation and reports R² > 0.9 within architectures plus rank correlations for cross-architecture generalization to unseen model-hardware pairs. This is standard data-driven fitting with explicit out-of-sample checks rather than a closed derivation that reduces to its own inputs by construction. The denoising-dominance hypothesis is stated as an assumption and used to motivate the FLOPs-based predictor, but the validation rests on direct total-energy measurements, not on any self-referential loop or self-citation chain. No load-bearing step equates a claimed prediction to a fitted parameter by definition.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on an empirically fitted power-law relationship between FLOPs and measured energy; no first-principles derivation is offered.

free parameters (1)

energy scaling coefficients
Coefficients in the adapted Kaplan-style power law relating FLOPs to GPU energy, fitted to experimental measurements.

axioms (1)

domain assumption Energy consumption of diffusion inference is dominated by the iterative denoising steps and scales as a power law with total FLOPs
Core hypothesis stated in the abstract and used to justify the scaling-law form.

pith-pipeline@v0.9.0 · 5574 in / 1152 out tokens · 36965 ms · 2026-05-17T21:03:21.452941+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We propose an adaptation of Kaplan scaling laws to predict GPU energy consumption for diffusion models based on computational complexity (FLOPs). ... log(E) = log(A) + α log(FLOPs×2^I_cfg) + β_dtype I_dtype + ...
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

FLOPstotal = FLOPs_text + N_steps × FLOPs_denoise + FLOPs_decode

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

33 extracted references · 33 canonical work pages · 8 internal anchors

[1]

The Gentle Singularity

Sam Altman. The Gentle Singularity. https://blog.samaltman.com/ the-gentle-singularity. [Accessed 18-10-2025]

work page 2025
[2]

arXiv preprint arXiv:2007.03051 doi:10.48550/arXiv.2007.03051

Lasse F Wolff Anthony, Benjamin Kanding, and Raghavendra Selvan. Carbontracker: Track- ing and predicting the carbon footprint of training deep learning models.arXiv preprint arXiv:2007.03051, 2020

work page arXiv 2007
[3]

de Araújo, JPW, and MinervaBooks

Benoit Courty, Victor Schmidt, Sasha Luccioni, Goyal-Kamal, MarionCoutarel, Boris Feld, Jérémy Lecourt, LiamConnell, Amine Saboni, Inimaz, supatomic, Mathilde Léval, Luis Blanche, Alexis Cruveiller, ouminasara, Franklin Zhao, Aditya Joshi, Alexis Bogroff, Hugues de La- voreille, Niko Laskaris, Edoardo Abati, Douglas Blank, Ziyao Wang, Armin Catovic, Marc ...

work page 2024
[4]

Measuring the environmental impact of delivering AI at Google Scale.arXiv preprint arXiv:2508.15734, 2025

Cooper Elsworth, Keguo Huang, David Patterson, Ian Schneider, Robert Sedivy, Savannah Goodman, Ben Townsend, Parthasarathy Ranganathan, Jeff Dean, Amin Vahdat, et al. Measuring the environmental impact of delivering AI at Google Scale.arXiv preprint arXiv:2508.15734, 2025

work page arXiv 2025
[5]

Scaling Rectified Flow Transformers for High-Resolution Image Synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transform- ers for high-resolution image synthesis.arXiv preprint arXiv:2403.03206, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[6]

Scaling rectified flow transformers for high-resolution image synthesis, 2024

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, Dustin Podell, Tim Dockhorn, Zion English, Kyle Lacey, Alex Goodwin, Yannik Marek, and Robin Rombach. Scaling rectified flow transformers for high-resolution image synthesis, 2024

work page 2024
[7]

Introducing gemini 2.5 flash image, our state-of-the-art image model

Alisa Fortin, Guillaume Vernade, Kat Kampf, and Ammaar Reshi. Introducing gemini 2.5 flash image, our state-of-the-art image model. Google Developers Blog, August 2025

work page 2025
[8]

Towards the systematic reporting of the energy and carbon footprints of machine learning

Peter Henderson, Jieru Hu, Joshua Romoff, Emma Brunskill, Dan Jurafsky, and Joelle Pineau. Towards the systematic reporting of the energy and carbon footprints of machine learning. Journal of Machine Learning Research, 21(248):1–43, 2020

work page 2020
[9]

Denoising diffusion probabilistic models.Advances in Neural Information Processing Systems, 33:6840–6851, 2020

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.Advances in Neural Information Processing Systems, 33:6840–6851, 2020

work page 2020
[10]

Classifier-Free Diffusion Guidance

Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[11]

Rae, and Laurent Sifre

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Tom Hennigan, Eric Noland, Katie Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karen Simonyan, Erich Elsen, Oriol Vinyals, Jack W. Rae, and Laurent Sifre...

work page 2022
[12]

Dissecting the NVIDIA Volta GPU Architecture via Microbenchmarking

Zhe Jia, Marco Maggioni, Benjamin Staiger, and Daniele Paolo Scarpazza. Dissecting the NVIDIA volta GPU architecture via microbenchmarking.arXiv preprint arXiv:1804.06826, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[13]

Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2001
[14]

Lawrence Zitnick, and Piotr Dollár

Tsung-Yi Lin, Michael Maire, Serge Belongie, Lubomir Bourdev, Ross Girshick, James Hays, Pietro Perona, Deva Ramanan, C. Lawrence Zitnick, and Piotr Dollár. Microsoft coco: Common objects in context, 2015. 10

work page 2015
[15]

Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps.Advances in Neural Information Processing Systems, 35:5775–5787, 2022

Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps.Advances in Neural Information Processing Systems, 35:5775–5787, 2022

work page 2022
[16]

Dpm-solver++: Fast solver for guided sampling of diffusion probabilistic models.Machine Intelligence Re- search, pages 1–22, 2025

Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. Dpm-solver++: Fast solver for guided sampling of diffusion probabilistic models.Machine Intelligence Re- search, pages 1–22, 2025

work page 2025
[17]

Estimating the carbon footprint of bloom, a 176b parameter language model.arXiv preprint arXiv:2211.02001, 2022

Alexandra Sasha Luccioni, Sylvain Viguier, and Anne-Laure Ligozat. Estimating the carbon footprint of bloom, a 176b parameter language model.arXiv preprint arXiv:2211.02001, 2022

work page arXiv 2022
[18]

Power hungry processing: Watts driving the cost of AI deployment?F AccT ’23: Proceedings of the 2023 ACM Conference on Fairness, Accountability, and Transparency, pages 85–99, 2023

Alexandra Sasha Luccioni, Sylvain Viguier, and Anne-Laure Ligozat. Power hungry processing: Watts driving the cost of AI deployment?F AccT ’23: Proceedings of the 2023 ACM Conference on Fairness, Accountability, and Transparency, pages 85–99, 2023

work page 2023
[19]

NVIDIA tensor core programmability, performance & precision

Stefano Markidis, Steven Wei Der Chien, Erwin Laure, Ivy Bo Peng, and Jeffrey S Vetter. NVIDIA tensor core programmability, performance & precision. In2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pages 522–531. IEEE, 2018

work page 2018
[20]

On distillation of guided diffusion models.Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14297–14306, 2023

Chenlin Meng, Robin Rombach, Ruiqi Gao, Diederik Kingma, Stefano Ermon, Jonathan Ho, and Tim Salimans. On distillation of guided diffusion models.Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14297–14306, 2023

work page 2023
[21]

Mixed Precision Training

Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory Diamos, Erich Elsen, David Garcia, Boris Ginsburg, Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh, et al. Mixed precision training.arXiv preprint arXiv:1710.03740, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[22]

Learning transferable visual models from natural language supervision, 2021

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agar- wal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision, 2021

work page 2021
[23]

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer, 2023

work page 2023
[24]

Zero-shot text-to-image generation

Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea V oss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation. InInternational conference on machine learning, pages 8821–8831. Pmlr, 2021

work page 2021
[25]

High-resolution image synthesis with latent diffusion models.Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10684–10695, 2022

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models.Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10684–10695, 2022

work page 2022
[26]

U-net: Convolutional networks for biomedical image segmentation, 2015

Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation, 2015

work page 2015
[27]

Progressive Distillation for Fast Sampling of Diffusion Models

Tim Salimans and Jonathan Ho. Progressive distillation for fast sampling of diffusion models. arXiv preprint arXiv:2202.00512, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[28]

Deep unsuper- vised learning using nonequilibrium thermodynamics

Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsuper- vised learning using nonequilibrium thermodynamics. InInternational conference on machine learning, pages 2256–2265. PMLR, 2015

work page 2015
[29]

Denoising Diffusion Implicit Models

Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models.arXiv preprint arXiv:2010.02502, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2010
[30]

Yang et al. Song. Score-based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2011
[31]

Energy and policy considerations for deep learning in NLP.Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3645–3650, 2019

Emma Strubell, Ananya Ganesh, and Andrew McCallum. Energy and policy considerations for deep learning in NLP.Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3645–3650, 2019. 11

work page 2019
[32]

Gomez, Lukasz Kaiser, and Illia Polosukhin

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need, 2023

work page 2023
[33]

gddim: Generalized denoising diffusion implicit models

Qinsheng Zhang, Molei Tao, and Yongxin Chen. gddim: Generalized denoising diffusion implicit models.arXiv preprint arXiv:2206.05564, 2022. 12 A Additional Validation Results A.1 Individual U-Net Architecture Validation Figure 4 provides detailed individual model validation for Stable Diffusion 2’s U-Net architecture across different GPU platforms. Despite...

work page arXiv 2022

[1] [1]

The Gentle Singularity

Sam Altman. The Gentle Singularity. https://blog.samaltman.com/ the-gentle-singularity. [Accessed 18-10-2025]

work page 2025

[2] [2]

arXiv preprint arXiv:2007.03051 doi:10.48550/arXiv.2007.03051

Lasse F Wolff Anthony, Benjamin Kanding, and Raghavendra Selvan. Carbontracker: Track- ing and predicting the carbon footprint of training deep learning models.arXiv preprint arXiv:2007.03051, 2020

work page arXiv 2007

[3] [3]

de Araújo, JPW, and MinervaBooks

Benoit Courty, Victor Schmidt, Sasha Luccioni, Goyal-Kamal, MarionCoutarel, Boris Feld, Jérémy Lecourt, LiamConnell, Amine Saboni, Inimaz, supatomic, Mathilde Léval, Luis Blanche, Alexis Cruveiller, ouminasara, Franklin Zhao, Aditya Joshi, Alexis Bogroff, Hugues de La- voreille, Niko Laskaris, Edoardo Abati, Douglas Blank, Ziyao Wang, Armin Catovic, Marc ...

work page 2024

[4] [4]

Measuring the environmental impact of delivering AI at Google Scale.arXiv preprint arXiv:2508.15734, 2025

Cooper Elsworth, Keguo Huang, David Patterson, Ian Schneider, Robert Sedivy, Savannah Goodman, Ben Townsend, Parthasarathy Ranganathan, Jeff Dean, Amin Vahdat, et al. Measuring the environmental impact of delivering AI at Google Scale.arXiv preprint arXiv:2508.15734, 2025

work page arXiv 2025

[5] [5]

Scaling Rectified Flow Transformers for High-Resolution Image Synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transform- ers for high-resolution image synthesis.arXiv preprint arXiv:2403.03206, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[6] [6]

Scaling rectified flow transformers for high-resolution image synthesis, 2024

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, Dustin Podell, Tim Dockhorn, Zion English, Kyle Lacey, Alex Goodwin, Yannik Marek, and Robin Rombach. Scaling rectified flow transformers for high-resolution image synthesis, 2024

work page 2024

[7] [7]

Introducing gemini 2.5 flash image, our state-of-the-art image model

Alisa Fortin, Guillaume Vernade, Kat Kampf, and Ammaar Reshi. Introducing gemini 2.5 flash image, our state-of-the-art image model. Google Developers Blog, August 2025

work page 2025

[8] [8]

Towards the systematic reporting of the energy and carbon footprints of machine learning

Peter Henderson, Jieru Hu, Joshua Romoff, Emma Brunskill, Dan Jurafsky, and Joelle Pineau. Towards the systematic reporting of the energy and carbon footprints of machine learning. Journal of Machine Learning Research, 21(248):1–43, 2020

work page 2020

[9] [9]

Denoising diffusion probabilistic models.Advances in Neural Information Processing Systems, 33:6840–6851, 2020

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.Advances in Neural Information Processing Systems, 33:6840–6851, 2020

work page 2020

[10] [10]

Classifier-Free Diffusion Guidance

Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[11] [11]

Rae, and Laurent Sifre

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Tom Hennigan, Eric Noland, Katie Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karen Simonyan, Erich Elsen, Oriol Vinyals, Jack W. Rae, and Laurent Sifre...

work page 2022

[12] [12]

Dissecting the NVIDIA Volta GPU Architecture via Microbenchmarking

Zhe Jia, Marco Maggioni, Benjamin Staiger, and Daniele Paolo Scarpazza. Dissecting the NVIDIA volta GPU architecture via microbenchmarking.arXiv preprint arXiv:1804.06826, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[13] [13]

Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2001

[14] [14]

Lawrence Zitnick, and Piotr Dollár

Tsung-Yi Lin, Michael Maire, Serge Belongie, Lubomir Bourdev, Ross Girshick, James Hays, Pietro Perona, Deva Ramanan, C. Lawrence Zitnick, and Piotr Dollár. Microsoft coco: Common objects in context, 2015. 10

work page 2015

[15] [15]

Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps.Advances in Neural Information Processing Systems, 35:5775–5787, 2022

Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps.Advances in Neural Information Processing Systems, 35:5775–5787, 2022

work page 2022

[16] [16]

Dpm-solver++: Fast solver for guided sampling of diffusion probabilistic models.Machine Intelligence Re- search, pages 1–22, 2025

Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. Dpm-solver++: Fast solver for guided sampling of diffusion probabilistic models.Machine Intelligence Re- search, pages 1–22, 2025

work page 2025

[17] [17]

Estimating the carbon footprint of bloom, a 176b parameter language model.arXiv preprint arXiv:2211.02001, 2022

Alexandra Sasha Luccioni, Sylvain Viguier, and Anne-Laure Ligozat. Estimating the carbon footprint of bloom, a 176b parameter language model.arXiv preprint arXiv:2211.02001, 2022

work page arXiv 2022

[18] [18]

Power hungry processing: Watts driving the cost of AI deployment?F AccT ’23: Proceedings of the 2023 ACM Conference on Fairness, Accountability, and Transparency, pages 85–99, 2023

Alexandra Sasha Luccioni, Sylvain Viguier, and Anne-Laure Ligozat. Power hungry processing: Watts driving the cost of AI deployment?F AccT ’23: Proceedings of the 2023 ACM Conference on Fairness, Accountability, and Transparency, pages 85–99, 2023

work page 2023

[19] [19]

NVIDIA tensor core programmability, performance & precision

Stefano Markidis, Steven Wei Der Chien, Erwin Laure, Ivy Bo Peng, and Jeffrey S Vetter. NVIDIA tensor core programmability, performance & precision. In2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pages 522–531. IEEE, 2018

work page 2018

[20] [20]

On distillation of guided diffusion models.Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14297–14306, 2023

Chenlin Meng, Robin Rombach, Ruiqi Gao, Diederik Kingma, Stefano Ermon, Jonathan Ho, and Tim Salimans. On distillation of guided diffusion models.Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14297–14306, 2023

work page 2023

[21] [21]

Mixed Precision Training

Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory Diamos, Erich Elsen, David Garcia, Boris Ginsburg, Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh, et al. Mixed precision training.arXiv preprint arXiv:1710.03740, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[22] [22]

Learning transferable visual models from natural language supervision, 2021

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agar- wal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision, 2021

work page 2021

[23] [23]

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer, 2023

work page 2023

[24] [24]

Zero-shot text-to-image generation

Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea V oss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation. InInternational conference on machine learning, pages 8821–8831. Pmlr, 2021

work page 2021

[25] [25]

High-resolution image synthesis with latent diffusion models.Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10684–10695, 2022

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models.Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10684–10695, 2022

work page 2022

[26] [26]

U-net: Convolutional networks for biomedical image segmentation, 2015

Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation, 2015

work page 2015

[27] [27]

Progressive Distillation for Fast Sampling of Diffusion Models

Tim Salimans and Jonathan Ho. Progressive distillation for fast sampling of diffusion models. arXiv preprint arXiv:2202.00512, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[28] [28]

Deep unsuper- vised learning using nonequilibrium thermodynamics

Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsuper- vised learning using nonequilibrium thermodynamics. InInternational conference on machine learning, pages 2256–2265. PMLR, 2015

work page 2015

[29] [29]

Denoising Diffusion Implicit Models

Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models.arXiv preprint arXiv:2010.02502, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2010

[30] [30]

Yang et al. Song. Score-based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2011

[31] [31]

Energy and policy considerations for deep learning in NLP.Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3645–3650, 2019

Emma Strubell, Ananya Ganesh, and Andrew McCallum. Energy and policy considerations for deep learning in NLP.Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3645–3650, 2019. 11

work page 2019

[32] [32]

Gomez, Lukasz Kaiser, and Illia Polosukhin

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need, 2023

work page 2023

[33] [33]

gddim: Generalized denoising diffusion implicit models

Qinsheng Zhang, Molei Tao, and Yongxin Chen. gddim: Generalized denoising diffusion implicit models.arXiv preprint arXiv:2206.05564, 2022. 12 A Additional Validation Results A.1 Individual U-Net Architecture Validation Figure 4 provides detailed individual model validation for Stable Diffusion 2’s U-Net architecture across different GPU platforms. Despite...

work page arXiv 2022