arxiv: 2605.01158 · v1 · submitted 2026-05-01 · 💻 cs.CY

Recognition: unknown

The Hidden Cost of Thinking: Energy Use and Environmental Impact of LMs Beyond Pretraining

Jacob Morrison , Noah A. Smith , Emma Strubell

Authors on Pith no claims yet

Pith reviewed 2026-05-09 17:53 UTC · model grok-4.3

classification 💻 cs.CY

keywords language modelsenvironmental impactenergy consumptioncarbon emissionsreinforcement learningpost-trainingdevelopment pipelinewater consumption

0 comments

The pith

Full language model development pipelines use 82.2% of compute on experimentation and failed runs, with reasoning models requiring 17 times more post-training energy than instruction-tuned versions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper measures the complete energy, carbon, and water use across every stage of developing the Olmo 3 family of language models. It shows that costs from experimentation, failed runs, and ablations make up the vast majority of total compute, exceeding earlier estimates that focused mainly on pretraining. Reasoning models prove far more expensive to refine through reinforcement learning than simpler instruction-tuned versions. These hidden costs remain largely unreported even as development pipelines grow more elaborate.

Core claim

For the Olmo 3 models, development costs including experimentation, failed runs, and ablations account for 82.2% of total compute, a roughly 65% increase over the approximately 50% reported for pretraining-focused pipelines. Reasoning models are 17 times more expensive to post-train than instruction-tuned counterparts, driven by reinforcement learning rollout generation. The entire process consumed approximately 12.3 GWh of datacenter energy, emitted 4,251 tCO2eq, and used 15,887 kL of water, with water consumption tied entirely to power generation infrastructure rather than direct data center cooling.

What carries the argument

The stage-by-stage environmental accounting of the complete language model development pipeline, including pretraining, supervised fine-tuning, preference optimization, reinforcement learning, and all associated experimentation and failed runs.

If this is right

Reasoning models incur substantially higher post-training energy costs due to reinforcement learning rollouts.
Development overheads significantly increase the overall environmental footprint beyond pretraining alone.
Environmental reporting standards must incorporate full pipeline costs to accurately reflect impacts.
Unreported costs will likely grow rapidly as post-training pipelines become more complex.
Efforts to reduce AI's environmental impact should address the entire development process rather than isolated stages.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Developers may need to adopt more efficient experimentation strategies to lower the share of failed runs.
Disclosure of full development impacts could raise public awareness of AI's true environmental footprint.
Optimization efforts might target reinforcement learning stages and trial-and-error workflows for greater efficiency gains.

Load-bearing premise

The energy, carbon, and water figures for each pipeline stage including reinforcement learning rollouts and failed experiments are complete and accurately measured without significant undercounting or reliance on unverified conversion factors.

What would settle it

An independent audit that directly measures the total datacenter energy and carbon emissions for a comparable full language model development pipeline and finds totals or proportions substantially different from 12.3 GWh or 82.2 percent overhead.

Figures

Figures reproduced from arXiv: 2605.01158 by Emma Strubell, Jacob Morrison, Noah A. Smith.

**Figure 1.** Figure 1: Building Olmo 3 cost 8.34M GPU-hours across pretraining, mid-training, long view at source ↗

read the original abstract

Modern language model development extends far beyond pretraining, yet environmental reporting remains narrowly focused on the cost of training a single final model. In this work, we provide the first detailed breakdown of the environmental impact of a full model development pipeline, from pretraining through supervised fine-tuning, preference optimization, and reinforcement learning, for Olmo 3, a family of 7 billion and 32 billion parameter models in both instruction-following and reasoning variants. We find that reasoning models are 17x more expensive to post-train than their instruction-tuned counterparts in terms of datacenter energy, driven by reinforcement learning rollout generation. Development costs (including experimentation, failed runs, and ablations) account for 82.2% of total compute, a roughly 65% increase over the ~50% reported for pretraining-focused pipelines in prior work. In total, we estimate our model development process consumed ~12.3 GWh of datacenter energy, emitted 4,251 tCO2eq, and consumed 15,887 kL of water, with water consumption driven entirely by power generation infrastructure rather than data center cooling. These costs, which are almost entirely unreported by model developers, are growing rapidly as post-training pipelines become more complex, and must be accounted for in environmental reporting standards and by the research community working to reduce AI's environmental impact.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper quantifies how development overhead and RL stages now dominate energy use for models like Olmo, but the 82% and 17x claims depend on unshown internal accounting for failed runs and conversion factors.

read the letter

This paper measures the full pipeline energy costs for the Olmo 3 family, from pretraining through SFT, preference tuning, and RL, for both instruction and reasoning variants. It reports that reasoning models cost 17 times more in post-training datacenter energy, driven by rollout generation, and that all the experimentation, failed runs, and ablations add up to 82% of total compute. The aggregate figures come to roughly 12.3 GWh, 4251 tCO2eq, and 15,887 kL water, with water tied to power generation rather than cooling. That stage-by-stage breakdown on named open models is new relative to the pretraining-only studies it references, and it usefully flags how much standard reporting misses as pipelines grow more complex. The concrete numbers on real runs give readers something to compare against their own work. The soft spots sit in the measurement pipeline. The headline shares and multipliers require that every failed experiment and RL rollout was fully logged and that the energy-to-carbon and water factors are accurate and complete. The abstract states the results without error bars, sensitivity checks, or detailed sources for PUE, grid intensity, or how overhead was tallied, so the 82% figure and the 17x gap could move if any of those steps undercounted or used off values. No raw logs or external audit are mentioned. This is worth a serious referee for groups tracking AI environmental impact or pushing reporting standards. A reader who needs data on post-training costs will find usable examples here, even if the exact percentages need more documentation to stand up. Send it to review.

Referee Report

3 major / 2 minor

Summary. The paper provides the first detailed environmental impact analysis of a complete LM development pipeline for the Olmo 3 family (7B and 32B parameter models in instruction-tuned and reasoning variants), covering pretraining through supervised fine-tuning, preference optimization, and reinforcement learning. It reports that development costs (experimentation, failed runs, ablations) comprise 82.2% of total compute (65% above prior pretraining-focused estimates), that reasoning models require 17x more post-training datacenter energy than instruction-tuned counterparts due to RL rollout generation, and that the full process consumed approximately 12.3 GWh of energy, emitted 4,251 tCO2eq, and used 15,887 kL of water (driven by power generation rather than cooling).

Significance. If the underlying measurements hold, the work is significant for demonstrating that post-training and hidden development overheads dominate the environmental footprint of modern LM pipelines far beyond pretraining, providing concrete numbers for a real model family that prior studies lacked. It strengthens the case for expanded reporting standards and could guide efficiency research, with credit due for the comprehensive stage-by-stage breakdown and inclusion of water metrics alongside energy and emissions.

major comments (3)

[Abstract] Abstract and results sections: The headline figures (82.2% development share, 17x reasoning multiplier, 12.3 GWh / 4,251 tCO2eq / 15,887 kL totals) are presented without error bars, sensitivity ranges, or explicit data sources for energy-to-carbon and energy-to-water conversion factors, undermining verification of the central quantitative claims.
[Methods] Methods or experimental setup: The accounting for development costs (including all failed runs and ablations) is described at a high level but lacks specifics on cluster logging completeness, post-hoc estimation procedures for unlogged experiments, or how RL rollout energy was isolated and scaled, which directly supports the 82.2% and 17x claims.
[Results] Results on post-training comparison: The 17x energy multiplier for reasoning vs. instruction-tuned models is attributed to RL rollouts, yet no details are given on rollout counts, sampling volumes, or per-rollout measurement methodology, making it impossible to assess whether the factor is robust or generalizable beyond this pipeline.

minor comments (2)

[Introduction] The reference to '~50% reported for pretraining-focused pipelines in prior work' would benefit from explicit citations to those studies for direct comparison.
[Results] Figure or table presenting the stage-by-stage breakdown should include units and conversion assumptions explicitly in the caption for clarity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive review and for recognizing the significance of providing a full-pipeline environmental analysis. The comments correctly identify areas where greater methodological transparency is needed. We address each point below and will revise the manuscript to incorporate the requested details.

read point-by-point responses

Referee: [Abstract] Abstract and results sections: The headline figures (82.2% development share, 17x reasoning multiplier, 12.3 GWh / 4,251 tCO2eq / 15,887 kL totals) are presented without error bars, sensitivity ranges, or explicit data sources for energy-to-carbon and energy-to-water conversion factors, undermining verification of the central quantitative claims.

Authors: We agree that the absence of uncertainty quantification and explicit conversion-factor sources limits verifiability. In the revised manuscript we will (i) cite the precise regional grid-intensity and water-use factors used (with references to the underlying EPA and utility data), (ii) add a sensitivity table showing how totals vary under ±20 % changes in carbon intensity and water-use rates, and (iii) report approximate uncertainty ranges derived from hardware-utilization variance and logging gaps. These additions will appear in both the Results section and a new subsection of the Methods; the abstract will be updated to reference the sensitivity analysis. revision: yes
Referee: [Methods] Methods or experimental setup: The accounting for development costs (including all failed runs and ablations) is described at a high level but lacks specifics on cluster logging completeness, post-hoc estimation procedures for unlogged experiments, or how RL rollout energy was isolated and scaled, which directly supports the 82.2% and 17x claims.

Authors: We acknowledge that the current description is insufficient for independent assessment. The revision will expand the Methods section with: (a) the exact coverage of the cluster logging system (percentage of jobs captured by the power-monitoring daemon), (b) the post-hoc estimation protocol for unlogged runs (linear regression on GPU-hours and model size calibrated against logged counterparts), and (c) the precise procedure used to isolate RL rollout energy (per-token power draw measured on the inference cluster, multiplied by total tokens generated across all rollouts). These details will directly underpin the 82.2 % and 17× figures. revision: yes
Referee: [Results] Results on post-training comparison: The 17x energy multiplier for reasoning vs. instruction-tuned models is attributed to RL rollouts, yet no details are given on rollout counts, sampling volumes, or per-rollout measurement methodology, making it impossible to assess whether the factor is robust or generalizable beyond this pipeline.

Authors: We agree that the 17× claim requires supporting quantitative detail. The revised Results section will report: the total number of RL rollouts performed for each model variant, the average number of sampled responses per prompt, and the per-rollout energy measurement method (direct power-meter readings on the inference nodes, averaged over representative batches). We will also add a short discussion of how these volumes compare with other published RL post-training pipelines, while noting that the exact multiplier is pipeline-specific. revision: yes

Circularity Check

0 steps flagged

No circularity: claims rest on direct empirical measurements of pipeline energy use

full rationale

The paper reports observed totals and breakdowns (12.3 GWh, 82.2% development share, 17x post-training gap) obtained by instrumenting their own Olmo 3 training runs, including failed experiments, ablations, SFT, preference optimization, and RL rollouts. These quantities are presented as measured outputs rather than derived via equations, fitted parameters, or self-referential definitions that would make the reported figures equivalent to the inputs by construction. External comparisons (e.g., to the ~50% figure in prior pretraining-focused work) cite independent literature and do not rely on load-bearing self-citations or uniqueness theorems from the authors' prior papers. The derivation chain is therefore a straightforward accounting exercise with no reduction of predictions to fitted inputs or ansatzes smuggled through citations.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

Abstract-only review limits visibility; energy conversion factors, water-use multipliers from power generation, and completeness of failed-run logging are not detailed but implicitly assumed accurate.

free parameters (2)

energy per RL rollout
Multiplier used to scale rollout generation costs; value not stated but drives the 17x claim.
development overhead fraction
82.2% figure depends on how failed runs and ablations are counted and converted to energy.

axioms (1)

domain assumption Datacenter energy measurements and carbon/water conversion factors from power infrastructure are accurate and complete.
Invoked when translating compute hours to GWh, tCO2eq, and kL water.

pith-pipeline@v0.9.0 · 5548 in / 1402 out tokens · 29550 ms · 2026-05-09T17:53:33.132867+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

13 extracted references · 13 canonical work pages · 5 internal anchors

[1]

Smith, Nicole DeCario, and Will Buchanan

URL https: //arxiv.org/abs/2206.05229. Epoch AI. Gpus power usage in ai data centers. https://epoch.ai/data-insights/ gpus-power-usage-in-ai-data-centers,

work page arXiv
[2]

URLhttps://arxiv.org/abs/2407.21783. Dirk Groeneveld, Iz Beltagy, Pete Walsh, Akshita Bhagia, Rodney Kinney, Oyvind Tafjord, Ananya Harsh Jha, Hamish Ivison, Ian Magnusson, Yizhong Wang, Shane Arora, David Atkinson, Russell Authur, Khyathi Raghavi Chandu, Arman Cohan, Jennifer Dumas, Yanai Elazar, Yuling Gu, Jack Hessel, Tushar Khot, William Merrill, Jaco...

work page internal anchor Pith review Pith/arXiv arXiv
[3]

H., Ivison, H., Magnusson, I., Wang, Y., et al

URL https://arxiv.org/abs/2402.00838. Pengfei Li, Jianyi Yang, Mohammad A. Islam, and Shaolei Ren. Making ai less ”thirsty”: Uncovering and addressing the secret water footprint of ai models,

work page arXiv
[4]

Alexandra Sasha Luccioni, Sylvain Viguier, and Anne-Laure Ligozat

URL https: //arxiv.org/abs/2304.03271. Alexandra Sasha Luccioni, Sylvain Viguier, and Anne-Laure Ligozat. Estimating the carbon footprint of bloom, a 176b parameter language model,

work page arXiv
[5]

arXiv preprint arXiv:2211.02001 , year=

URL https://arxiv.org/ abs/2211.02001. Sasha Luccioni, Yacine Jernite, and Emma Strubell. Power hungry processing: Watts driving the cost of ai deployment? InThe 2024 ACM Conference on Fairness Accountability and Transparency, FAccT ’24, pp. 85–99. ACM, June

work page arXiv 2024
[6]

Power Hungry Processing: Watts Driving the Cost of AI Deployment? , url=

doi: 10.1145/3630106.3658542. URL http://dx.doi.org/10.1145/3630106.3658542. Jacob Morrison, Clara Na, Jared Fernandez, Tim Dettmers, Emma Strubell, and Jesse Dodge. Holistically evaluating the environmental impact of creating language models,

work page doi:10.1145/3630106.3658542
[7]

URL https://arxiv.org/abs/2503.05804. 11 Team Olmo, :, Allyson Ettinger, Amanda Bertsch, Bailey Kuehl, David Graham, David Heineman, Dirk Groeneveld, Faeze Brahman, Finbarr Timbers, Hamish Ivison, Jacob Morrison, Jake Poznanski, Kyle Lo, Luca Soldaini, Matt Jordan, Mayee Chen, Michael Noukhovitch, Nathan Lambert, Pete Walsh, Pradeep Dasigi, Robert Berry, ...

work page arXiv
[8]

URLhttps://arxiv.org/abs/2512.13961. Team OLMo, Pete Walsh, Luca Soldaini, Dirk Groeneveld, Kyle Lo, Shane Arora, Akshita Bhagia, Yuling Gu, Shengyi Huang, Matt Jordan, Nathan Lambert, Dustin Schwenk, Oyvind Tafjord, Taira Anderson, David Atkinson, Faeze Brahman, Christopher Clark, Pradeep Dasigi, Nouha Dziri, Allyson Ettinger, Michal Guerquin, David Hein...

work page internal anchor Pith review Pith/arXiv arXiv
[9]

2 OLMo 2 Furious

URLhttps://arxiv.org/abs/2501.00656. Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model,

work page internal anchor Pith review arXiv
[10]

Direct Preference Optimization: Your Language Model is Secretly a Reward Model

URLhttps://arxiv.org/abs/2305.18290. Paul Reig, Tianyi Luo, Eric Christensen, and Julie Sinistore. Guidance for calculat- ing water use embedded in purchased electricity. https://www.wri.org/research/ guidance-calculating-water-use-embedded-purchased-electricity,

work page internal anchor Pith review arXiv
[11]

Green ai

URL https://arxiv.org/abs/1907.10597. Emma Strubell, Ananya Ganesh, and Andrew McCallum. Energy and policy considerations for deep learning in nlp,

work page arXiv 1907
[12]

URLhttps://arxiv.org/abs/1906.02243. Gemma Team, Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupatiraju, Shreya Pathak, Laurent Sifre, Morgane Rivi`ere, Mihir Sanjay Kale, Juliette Love, Pouya Tafti, L´eonard Hussenot, Pier Giuseppe Sessa, Aakanksha Chowdhery, Adam Roberts, Aditya Barua, Alex Botev, Alex Castro-Ros, Ambrose Slone, Am ´elie H ´e...

work page Pith review arXiv 1906
[13]

URL https://arxiv.org/abs/2403.08295. 12 Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth´ee Lacroix, Baptiste Rozi`ere, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. Llama: Open and efficient foundation language models, 2023a. URLhttps://arxiv.org/...

work page internal anchor Pith review arXiv