pith. machine review for the scientific record. sign in

arxiv: 2104.10350 · v3 · submitted 2021-04-21 · 💻 cs.LG · cs.CY

Recognition: 3 theorem links

· Lean Theorem

Carbon Emissions and Large Neural Network Training

Authors on Pith no claims yet

Pith reviewed 2026-05-11 23:43 UTC · model grok-4.3

classification 💻 cs.LG cs.CY
keywords carbon emissionsneural network trainingenergy efficiencymachine learningCO2 footprintdatacentersacceleratorssparse models
0
0 comments X

The pith

Choices in neural network design, training location, and hardware can reduce the carbon footprint of large AI models by up to 1000 times.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper calculates the energy consumption and carbon emissions from training recent large models including T5, Meena, GShard, Switch Transformer, and GPT-3, while updating earlier figures for neural architecture search. It shows that sparsely activated networks use less than one tenth the energy of dense networks for the same accuracy level. Differences in the share of carbon-free electricity across locations create five- to tenfold variations in emissions even inside one country. Cloud data centers and specialized accelerators add further efficiency gains of several times. These factors multiply to allow overall reductions reaching two to three orders of magnitude, leading the authors to recommend that energy use and CO2e be reported as standard evaluation metrics alongside accuracy.

Core claim

Calculations for several recent large models show that large but sparsely activated DNNs consume less than one tenth the energy of large dense DNNs without loss of accuracy. Geographic location changes the fraction of carbon-free energy and resulting CO2e by factors of five to ten. Cloud data centers are 1.4 to 2 times more energy efficient than typical facilities, and ML-oriented accelerators inside them are 2 to 5 times more effective than general-purpose systems. The combined choice of DNN architecture, datacenter, and processor therefore allows carbon footprint reductions up to 100-1000 times.

What carries the argument

The side-by-side energy and CO2e calculations across dense versus sparse DNN architectures, different geographic carbon intensities, datacenter infrastructure levels, and general versus ML-specific processors.

If this is right

  • Sparsely activated models achieve comparable accuracy while using under one tenth the energy of equivalent dense models.
  • Scheduling workloads in locations with higher carbon-free energy shares reduces emissions by factors of five to ten.
  • Cloud data centers combined with ML accelerators improve energy efficiency by roughly three to ten times over standard setups.
  • Reporting energy consumption and CO2e in papers on large-scale training prevents inaccurate later estimates.
  • Adding energy usage to benchmarks such as MLPerf would make efficiency a primary evaluation criterion.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Workload schedulers could move large training jobs to low-carbon periods and regions in real time to cut emissions without altering model code.
  • The large variability implies that past aggregate estimates of AI's total carbon impact may require downward revision when actual training conditions are taken into account.
  • Model selection processes may begin to treat energy efficiency as a first-class constraint alongside accuracy and speed.
  • Similar calculations could be applied to inference workloads, affecting choices about where and how deployed models run.

Load-bearing premise

The carbon intensity figures for specific datacenters and the power draw estimates for accelerators and systems are taken as accurate without independent verification.

What would settle it

Direct metering of electricity consumption during training of a model such as Switch Transformer or GPT-3 at two locations with documented carbon intensities, followed by comparison of the measured emission ratio against the predicted five- to tenfold geographic difference.

read the original abstract

The computation demand for machine learning (ML) has grown rapidly recently, which comes with a number of costs. Estimating the energy cost helps measure its environmental impact and finding greener strategies, yet it is challenging without detailed information. We calculate the energy use and carbon footprint of several recent large models-T5, Meena, GShard, Switch Transformer, and GPT-3-and refine earlier estimates for the neural architecture search that found Evolved Transformer. We highlight the following opportunities to improve energy efficiency and CO2 equivalent emissions (CO2e): Large but sparsely activated DNNs can consume <1/10th the energy of large, dense DNNs without sacrificing accuracy despite using as many or even more parameters. Geographic location matters for ML workload scheduling since the fraction of carbon-free energy and resulting CO2e vary ~5X-10X, even within the same country and the same organization. We are now optimizing where and when large models are trained. Specific datacenter infrastructure matters, as Cloud datacenters can be ~1.4-2X more energy efficient than typical datacenters, and the ML-oriented accelerators inside them can be ~2-5X more effective than off-the-shelf systems. Remarkably, the choice of DNN, datacenter, and processor can reduce the carbon footprint up to ~100-1000X. These large factors also make retroactive estimates of energy cost difficult. To avoid miscalculations, we believe ML papers requiring large computational resources should make energy consumption and CO2e explicit when practical. We are working to be more transparent about energy use and CO2e in our future research. To help reduce the carbon footprint of ML, we believe energy usage and CO2e should be a key metric in evaluating models, and we are collaborating with MLPerf developers to include energy usage during training and inference in this industry standard benchmark.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 3 minor

Summary. The paper estimates the energy consumption and carbon footprint of several large neural network models including T5, Meena, GShard, Switch Transformer, and GPT-3, while refining earlier estimates for the Evolved Transformer found via neural architecture search. It identifies four main opportunities for reducing CO2e: sparsely activated DNNs consuming <1/10 the energy of dense models, geographic location affecting carbon intensity by 5-10X even within the same country, Cloud datacenters being 1.4-2X more efficient than typical ones, and ML accelerators being 2-5X more effective than off-the-shelf hardware. These factors are multiplied to claim potential carbon footprint reductions of up to ~100-1000X. The authors advocate making energy consumption and CO2e explicit in ML papers and incorporating energy metrics into benchmarks such as MLPerf.

Significance. If the estimates and multiplicative factors are robust, the work is significant for quantifying the environmental costs of scaling ML and for outlining concrete, high-impact mitigation strategies based on model architecture, scheduling location, infrastructure, and hardware. The explicit numerical baselines for multiple recent models and the call for energy to become a standard evaluation metric alongside accuracy provide a useful reference point for the community and could encourage more reproducible reporting practices.

major comments (3)
  1. Abstract: the headline claim that DNN/datacenter/processor choice yields up to ~100-1000X lower CO2e is obtained by multiplying four independent factors (sparsity <1/10, location 5-10X, datacenter 1.4-2X, accelerator 2-5X); the manuscript provides no sensitivity analysis or error bounds showing how plausible 2-3X variations in any single input (carbon intensity or power-draw model) would affect the upper end of the reported range.
  2. Sections detailing per-model energy calculations (T5, Meena, GShard, Switch, GPT-3): the baseline energy figures are derived from hardware specifications, assumed utilization rates, and training durations without primary measurement data, cross-validation, or explicit exclusion rules, which directly scales all subsequent relative savings and the 100-1000X claim.
  3. Geographic location and datacenter efficiency paragraphs: the 5-10X carbon-free energy variation and 1.4-2X datacenter efficiency gains are stated without citing the specific carbon-intensity tables, PUE values, or regional grid data sources used, leaving the load-bearing numerical inputs un-auditable.
minor comments (3)
  1. Abstract: the forward-looking statement 'we are now optimizing where and when large models are trained' lacks any accompanying details on methodology or preliminary results.
  2. Throughout: several efficiency ranges (1.4-2X, 2-5X) would benefit from explicit references to the supporting studies or internal measurements.
  3. Final paragraph: the proposal to add energy metrics to MLPerf could be strengthened by discussing how consistent measurement protocols would be defined across heterogeneous hardware.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments on our manuscript. We address each major comment point by point below, providing our responses and indicating where revisions have been made to improve clarity, transparency, and auditability.

read point-by-point responses
  1. Referee: Abstract: the headline claim that DNN/datacenter/processor choice yields up to ~100-1000X lower CO2e is obtained by multiplying four independent factors (sparsity <1/10, location 5-10X, datacenter 1.4-2X, accelerator 2-5X); the manuscript provides no sensitivity analysis or error bounds showing how plausible 2-3X variations in any single input (carbon intensity or power-draw model) would affect the upper end of the reported range.

    Authors: The ~100-1000X range illustrates the cumulative potential obtained by multiplying the upper ends of each independent factor (sparsity savings, location variation, datacenter efficiency, and accelerator gains), each drawn from observed ranges in practice. These are presented as separate opportunities rather than a single combined scenario. We agree that noting the impact of input variations would strengthen the presentation. In the revised manuscript, we have added a sentence in the abstract and a short paragraph in the discussion clarifying that the upper bound is illustrative, that the factors are multiplicative and independent, and that even with 2-3X uncertainty in any one input the order-of-magnitude potential remains substantial. revision: partial

  2. Referee: Sections detailing per-model energy calculations (T5, Meena, GShard, Switch, GPT-3): the baseline energy figures are derived from hardware specifications, assumed utilization rates, and training durations without primary measurement data, cross-validation, or explicit exclusion rules, which directly scales all subsequent relative savings and the 100-1000X claim.

    Authors: The baseline energy figures are retrospective estimates constructed from publicly reported training durations, hardware power specifications, and typical utilization rates (e.g., 30-50% for large-scale training) as documented in the source papers for each model. Primary measurement data from the original training runs is not available to us, as the models were developed by multiple organizations. We have expanded the methods and appendix sections in the revision to list the exact sources, assumptions, and any exclusion criteria used for each model, improving traceability while preserving the original estimates. revision: yes

  3. Referee: Geographic location and datacenter efficiency paragraphs: the 5-10X carbon-free energy variation and 1.4-2X datacenter efficiency gains are stated without citing the specific carbon-intensity tables, PUE values, or regional grid data sources used, leaving the load-bearing numerical inputs un-auditable.

    Authors: We appreciate this observation. In the revised manuscript we have inserted explicit citations for the carbon-intensity ranges (drawing on regional grid data from electricityMap and U.S. EPA reports showing 5-10X differences even within countries) and for datacenter PUE values (citing industry reports from Google, Microsoft, and the Uptime Institute documenting Cloud PUEs of ~1.1-1.4 versus typical values of 1.5-2.0). These additions render the numerical inputs fully auditable. revision: yes

Circularity Check

0 steps flagged

No circularity: estimates are direct calculations from external hardware and grid inputs

full rationale

The paper computes energy use and CO2e for models including T5, Meena, GShard, Switch Transformer, and GPT-3 using stated training durations, utilization rates, power draws, and regional carbon-intensity values as direct inputs. The 100-1000X reduction range is obtained by multiplying independent factors (location 5-10X, datacenter 1.4-2X, accelerator 2-5X, sparsity <1/10) drawn from hardware comparisons and grid data rather than any fitted parameter or self-referential definition. No equation or claim reduces a reported result to its own inputs by construction, and the derivation chain remains self-contained against the provided external benchmarks.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The estimates rest on standard hardware power models and regional carbon-intensity tables; no new entities are postulated and only a small number of scaling assumptions are required.

free parameters (2)
  • PUE (power usage effectiveness)
    Typical datacenter overhead factor applied to compute energy; value not stated in abstract but required for final CO2e.
  • carbon intensity per kWh
    Grid-specific values that vary by location and time; these are external data rather than fitted inside the paper.
axioms (1)
  • domain assumption Published hardware specifications and prior energy models accurately reflect actual training power draw
    Invoked when converting FLOPs or runtime to kWh.

pith-pipeline@v0.9.0 · 5664 in / 1245 out tokens · 40982 ms · 2026-05-11T23:43:06.668214+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • IndisputableMonolith.Foundation.DAlembert.Inevitability bilinear_family_forced contradicts
    ?
    contradicts

    CONTRADICTS: the theorem conflicts with this paper passage, or marks a claim that would need revision before publication.

    the choice of DNN, datacenter, and processor can reduce the carbon footprint up to ~100-1000X. These large factors also make retroactive estimates of energy cost difficult.

  • IndisputableMonolith.Cost.FunctionalEquation washburn_uniqueness_aczel contradicts
    ?
    contradicts

    CONTRADICTS: the theorem conflicts with this paper passage, or marks a claim that would need revision before publication.

    Geographic location matters... fraction of carbon-free energy and resulting CO2e vary ~5X-10X... Specific datacenter infrastructure matters, as Cloud datacenters can be ~1.4-2X more energy efficient... ML-oriented accelerators... ~2-5X more effective

  • IndisputableMonolith.Foundation.PhiForcing phi_equation echoes
    ?
    echoes

    ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

    Large but sparsely activated DNNs can consume <1/10th the energy of large, dense DNNs without sacrificing accuracy despite using as many or even more parameters

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 37 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. An Amortized Efficiency Threshold for Comparing Neural and Heuristic Solvers in Combinatorial Optimization

    cs.LG 2026-05 unverdicted novelty 7.0

    The paper introduces the Amortized Efficiency Threshold (AET) to identify the deployment volume at which neural combinatorial optimization solvers become more energy-efficient overall than heuristic baselines after am...

  2. DUET: Optimize Token-Budget Allocation for Reinforcement Learning with Verifiable Rewards

    cs.LG 2026-05 unverdicted novelty 7.0

    DUET improves RLVR by allocating tokens across both prompt selection and rollout length, outperforming full-budget baselines even when using only half the tokens.

  3. Token Arena: A Continuous Benchmark Unifying Energy and Cognition in AI Inference

    cs.AI 2026-05 unverdicted novelty 7.0

    TokenArena is a continuous benchmark for AI inference endpoints that measures output speed, time to first token, blended price, effective context, quality, and modeled energy to produce composites of joules per correc...

  4. SAT: Sequential Agent Tuning for Coordinator Free Plug and Play Multi-LLM Training with Monotonic Improvement Guarantees

    cs.LG 2026-04 unverdicted novelty 7.0

    SAT trains multi-LLM teams with sequential block updates to deliver monotonic gains and plug-and-play model swaps that provably improve performance bounds.

  5. Training single-electron and single-photon stochastic physical neural networks

    quant-ph 2026-04 unverdicted novelty 7.0

    Single-electron and single-photon stochastic physical neural networks achieve over 97% MNIST test accuracy when trained with empirical outputs in the backward pass using few trials per layer.

  6. The Phase Is the Gradient: Equilibrium Propagation for Frequency Learning in Kuramoto Networks

    cs.LG 2026-04 unverdicted novelty 7.0

    In Kuramoto networks at equilibrium, weak nudging makes phase displacement the exact gradient of loss w.r.t. natural frequencies, enabling frequency learning that beats weight learning and resolves convergence via spe...

  7. Segment Anything

    cs.CV 2023-04 unverdicted novelty 7.0

    A promptable model trained on 1B masks achieves competitive zero-shot segmentation performance across tasks and is released publicly with its dataset.

  8. Mass-Editing Memory in a Transformer

    cs.CL 2022-10 conditional novelty 7.0

    MEMIT scales direct memory editing in transformers from single facts to thousands of associations by optimizing MLP weight updates.

  9. OPT: Open Pre-trained Transformer Language Models

    cs.CL 2022-05 unverdicted novelty 7.0

    OPT releases open decoder-only transformers up to 175B parameters that match GPT-3 performance at one-seventh the carbon cost, along with code and training logs.

  10. High-Resolution Image Synthesis with Latent Diffusion Models

    cs.CV 2021-12 conditional novelty 7.0

    Latent diffusion models achieve state-of-the-art inpainting and competitive results on unconditional generation, scene synthesis, and super-resolution by performing the diffusion process in the latent space of pretrai...

  11. Multitask Prompted Training Enables Zero-Shot Task Generalization

    cs.LG 2021-10 conditional novelty 7.0

    Multitask fine-tuning of an encoder-decoder model on prompted datasets produces zero-shot generalization that often beats models up to 16 times larger on standard benchmarks.

  12. Recasting AI Data Centers as Engines for Carbon Removal

    math.OC 2026-05 unverdicted novelty 6.0

    AI data center waste heat upgraded by heat pumps can drive direct air capture to achieve net CO2 removal and offset operational emissions in several US states under current and 2030 scenarios.

  13. Language-Conditioned Visual Grounding with CLIP Multilingual

    cs.CL 2026-05 unverdicted novelty 6.0

    Fixing the visual encoder in multilingual CLIP isolates text-branch deficits as the cause of lower visual grounding performance for low-resource languages, with model scaling widening some gaps but not others.

  14. A Hardware-aware Hopfield Network with a Nonlinear Memristor Array for Robust Associative Memory with Superlinear Capacity

    cond-mat.dis-nn 2026-05 unverdicted novelty 6.0

    A memristor-array Hopfield network uses device nonlinearity to exceed classical memory capacity with K ~ 0.14N experimentally and superlinear K ~ 0.3 N^1.2 in simulations.

  15. OpenG2G: A Simulation Platform for AI Datacenter-Grid Runtime Coordination

    cs.LG 2026-05 unverdicted novelty 6.0

    OpenG2G is a new extensible simulation platform that lets users implement and compare classic, optimization, and learning-based controllers for AI datacenter power flexibility coordinated with the grid.

  16. A Meta Reinforcement Learning Approach to Goals-Based Wealth Management

    cs.LG 2026-05 unverdicted novelty 6.0

    MetaRL pre-trained on GBWM problems delivers near-optimal dynamic strategies in 0.01s achieving 97.8% of DP optimal utility and handles larger problems where DP fails.

  17. Are Large Language Models Economically Viable for Industry Deployment?

    cs.CL 2026-04 unverdicted novelty 6.0

    Small LLMs under 2B parameters achieve better economic break-even, energy efficiency, and hardware density than larger models on legacy GPUs for industrial tasks.

  18. TRON: Trainable, architecture-reconfigurable random optical neural networks

    physics.optics 2026-04 unverdicted novelty 6.0

    TRON demonstrates a trainable and reconfigurable optical neural network that combines multi-scattering media with DMD-based matrix multiplication and performs in-situ optimization plus neural architecture search on th...

  19. Watt Counts: Energy-Aware Benchmark for Sustainable LLM Inference on Heterogeneous GPU Architectures

    cs.DC 2026-04 unverdicted novelty 6.0

    Watt Counts supplies over 5,000 energy measurements across 50 LLMs and 10 GPUs and shows that hardware-aware selection can reduce server-scenario energy use by up to 70 percent with little effect on user experience.

  20. SAM 2: Segment Anything in Images and Videos

    cs.CV 2024-08 conditional novelty 6.0

    SAM 2 delivers more accurate video segmentation with 3x fewer user interactions and 6x faster image segmentation than the original SAM by training a streaming-memory transformer on the largest video segmentation datas...

  21. BLOOM: A 176B-Parameter Open-Access Multilingual Language Model

    cs.CL 2022-11 unverdicted novelty 6.0

    BLOOM is a 176B-parameter open-access multilingual language model trained on the ROOTS corpus that achieves competitive performance on benchmarks, with improved results after multitask prompted finetuning.

  22. PaLM: Scaling Language Modeling with Pathways

    cs.CL 2022-04 accept novelty 6.0

    PaLM 540B demonstrates continued scaling benefits by setting new few-shot SOTA results on hundreds of benchmarks and outperforming humans on BIG-bench.

  23. ST-MoE: Designing Stable and Transferable Sparse Expert Models

    cs.CL 2022-02 unverdicted novelty 6.0

    ST-MoE introduces stability techniques for sparse expert models, allowing a 269B-parameter model to achieve state-of-the-art transfer learning results across reasoning, summarization, and QA tasks at the compute cost ...

  24. LaMDA: Language Models for Dialog Applications

    cs.CL 2022-01 unverdicted novelty 6.0

    LaMDA shows that fine-tuning on human-value annotations and consulting external knowledge sources significantly improves safety and factual grounding in large dialog models beyond what scaling alone achieves.

  25. Ethical and social risks of harm from Language Models

    cs.CL 2021-12 accept novelty 6.0

    The authors provide a detailed taxonomy of 21 risks associated with language models, covering discrimination, information leaks, misinformation, malicious applications, interaction harms, and societal impacts like job...

  26. Position: LLM Inference Should Be Evaluated as Energy-to-Token Production

    cs.CE 2026-05 unverdicted novelty 5.0

    LLM inference should be reframed and evaluated as energy-to-token production with a Token Production Function that accounts for power, cooling, and efficiency ceilings.

  27. UniSD: Towards a Unified Self-Distillation Framework for Large Language Models

    cs.CL 2026-05 unverdicted novelty 5.0

    UniSD unifies complementary self-distillation mechanisms for autoregressive LLMs and achieves up to +5.4 point gains over base models and +2.8 over baselines across six benchmarks and six models.

  28. ELAS: Efficient Pre-Training of Low-Rank Large Language Models via 2:4 Activation Sparsity

    cs.LG 2026-05 unverdicted novelty 5.0

    ELAS pre-trains low-rank LLMs by applying 2:4 activation sparsity after squared ReLU to cut memory and accelerate training with minimal performance loss.

  29. Toward a Sustainable Software Architecture Community: Evaluating ICSA's Environmental Impact

    cs.SE 2026-04 unverdicted novelty 5.0

    The study provides exploratory estimates of carbon emissions from GenAI inference in ICSA papers and from the full operations of the ICSA 2025 conference.

  30. DINOv2: Learning Robust Visual Features without Supervision

    cs.CV 2023-04 unverdicted novelty 5.0

    Pith review generated a malformed one-line summary.

  31. From Cradle to Cloud: A Life Cycle Review of AI's Environmental Footprint

    cs.CY 2026-05 unverdicted novelty 4.0

    A review of AI sustainability studies finds inconsistent life cycle definitions and predominant reliance on coarse CO2e proxies, with limited coverage of water, materials, and multi-impact assessments.

  32. Carbon-Taxed Transformers: A Green Compression Pipeline for Overgrown Language Models

    cs.SE 2026-04 unverdicted novelty 4.0

    CTT is a compression pipeline for LLMs that achieves up to 49x memory reduction, 10x faster inference, 81% lower CO2 emissions, and retains 68-98% accuracy on code clone detection, summarization, and generation tasks.

  33. AI-Native Autonomous Infrastructure (ANAI): A Formal Framework for the Next General-Purpose Technology

    eess.SY 2026-04 unverdicted novelty 4.0

    Introduces ANAI framework with Autonomy Index (AIx), Infrastructure Coupling Coefficient (ICC), and Technological Transition Potential (TTP) to model AI-driven infrastructural transition via nonlinear coevolution and ...

  34. minAction.net: Energy-First Neural Architecture Design -- From Biological Principles to Systematic Validation

    cs.LG 2026-04 conditional novelty 4.0

    Large-scale experiments show architecture performance depends on task type, not universality, and a single-parameter energy penalty reduces computational energy by ~1000x with negligible accuracy cost.

  35. SymptomWise: A Deterministic Reasoning Layer for Reliable and Efficient AI Systems

    cs.AI 2026-04 unverdicted novelty 4.0

    SymptomWise uses expert knowledge and deterministic rules for diagnosis after LLM-based symptom extraction, achieving 88% top-5 accuracy on 42 challenging pediatric neurology cases.

  36. Analytic Framework for Estimating Memory Cost

    cs.ET 2026-05 unverdicted novelty 3.0

    An analytic framework is introduced to estimate memory-related energy costs of AI models and quantify their ecological footprint.

  37. Unbox Responsible GeoAI: Navigating Climate Extreme and Disaster Mapping

    cs.CY 2026-05 unverdicted novelty 3.0

    Responsible GeoAI for disaster mapping requires governance across data, applications, and society rather than algorithm improvements alone.

Reference graph

Works this paper leans on

3 extracted references · 3 canonical work pages · cited by 37 Pith papers

  1. [1]

    We use Google Georgia datacenter’s PUE from the period in which the search computation was run (1.10 in Table 4) instead of the US average in 2018 (1.58)

  2. [2]

    used the US average CO 2 per kilowatt hour (KWh) as calculated by the U.S

    Strubell et al. used the US average CO 2 per kilowatt hour (KWh) as calculated by the U.S. Environmental Protection Agency (EPA) of 0.423 kg per KWh in 2018. For Google, we use the Georgia datacenter’s average CO 2 e/KWh for the month when NAS was performed (0.431 CO 2 e/KWh in Table 4)

  3. [3]

    used Google TPU v2 accelerators, not NVIDIA P100 GPUs as modeled in [Str19]

    So et al. used Google TPU v2 accelerators, not NVIDIA P100 GPUs as modeled in [Str19]. TPU v2s are much faster, so the search process takes 32,633 TPU v2 hours instead of 117,780 P100 hours. We measured the power when running the [So19] NAS computation on TPU v2, including the memory, fans, network interfaces, and the CPU host. The average power was 208 W...